Search Configuration

This document describes the configuration options for the semantic search system, including pgvector iterative scanning for improved result retrieval.

Configuration

The search system can be configured via the search section in configs/config.yaml:

search:
  iterative_scan_mode: relaxed_order  # pgvector 0.8.0+ iterative scanning mode
                                       # Options: "off", "strict_order", "relaxed_order"
                                       # relaxed_order recommended for best recall/performance

Environment Variable

You can override the configuration using the CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE environment variable:

export CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE=relaxed_order

Iterative Scanning

Overview

Iterative scanning is a pgvector 0.8.0+ feature that solves the "overfiltering" problem in vector similarity searches. When you apply filters (e.g., by repository, language, file type), traditional approximate nearest neighbor (ANN) indexes might return fewer results than requested because they pre-filter before scanning the index.

Problem Example:

User requests 10 results filtered to repository "example/frontend"
Vector search finds 50 candidates total
After filtering by repository, only 3 match
User receives 3 results instead of the requested 10, even though more matching results exist in the database

Solution: Iterative scanning automatically expands the search scope when filters reduce the result count, ensuring you get the requested number of results even with highly selective filters.

Modes

`off` (Disabled)

Default pgvector behavior without iterative scanning
Fastest queries but may return fewer results than requested when filters are selective
Use when: You don't need guaranteed result counts or filters are not very selective

`strict_order` (Strict Distance Ordering)

Enables iterative scanning with exact distance ordering guarantees
Slower than relaxed_order but maintains perfect similarity score ordering
Use when: Exact ordering is critical (e.g., top-k retrieval for benchmarks)

`relaxed_order` (Recommended)

Enables iterative scanning with approximate ordering
Best balance of performance, recall, and result completeness
Results are still well-ordered but may have minor deviations in similarity scores
Use when: You need good recall and completeness (most use cases)
Default setting for this application

How It Works

When iterative scanning is enabled:

PostgreSQL performs initial vector search with requested LIMIT
Applies repository/language/file-type filters at the SQL level
If filtered results < requested limit, automatically fetches more candidates
Repeats until desired number of results achieved or no more candidates available
Returns up to the requested LIMIT of filtered results

Important: For iterative scanning to work effectively, filters must be applied at the SQL level (in the WHERE clause), not in application code. See the Metadata Filtering section below.

Performance Characteristics

Mode	Query Speed	Result Completeness	Ordering Accuracy
`off`	Fastest	Low (with filters)	Exact
`relaxed_order`	Fast	High	Approximate
`strict_order`	Slower	High	Exact

Example

Without iterative scanning (off):

Query: Find 20 code chunks about "authentication" in repository "auth-service"
Vector search finds: 100 candidates
After repository filter: 8 matches
Result: Only 8 chunks returned (fewer than requested)

With iterative scanning (relaxed_order):

Query: Find 20 code chunks about "authentication" in repository "auth-service"
1st iteration: Vector search finds 100 candidates → 8 matches after filter
2nd iteration: Vector search expands → finds 200 candidates → 15 matches
3rd iteration: Vector search expands → finds 300 candidates → 20 matches ✓
Result: Full 20 chunks returned as requested

Metadata Filtering

Overview

To maximize the effectiveness of pgvector's iterative scanning, the system performs metadata filtering at the SQL level (in the WHERE clause) rather than in application code. This is critical because iterative scanning needs to know which results pass filters in order to decide whether to expand the search scope.

Supported SQL-Level Filters

The following filters are applied at the database level and work seamlessly with iterative scanning:

Filter Type	Description	Example
Repository	Filter by repository ID or name	`repository_ids: ["uuid1", "uuid2"]`
Language	Filter by programming language	`languages: ["go", "python"]`
Chunk Type	Filter by semantic construct type	`types: ["function", "class"]`
File Extension	Filter by file type	`file_types: [".go", ".py"]`

Application-Level Filters

Some filters cannot be applied at the SQL level because they require data from the code_chunks table (not denormalized to embeddings_partitioned). These are applied after the vector search:

Filter Type	Description	Performance Impact
Entity Name	Filter by function/class name	Low - rare use case
Visibility	Filter by visibility modifier	Low - rare use case

Note: Application-level filters reduce the benefit of iterative scanning because filtering happens after the vector search completes.

Technical Implementation

Denormalized Metadata Columns

The embeddings_partitioned table includes denormalized metadata columns for SQL-level filtering:

-- Schema
CREATE TABLE codechunking.embeddings_partitioned (
    id UUID PRIMARY KEY,
    chunk_id UUID NOT NULL,
    repository_id UUID NOT NULL,
    embedding vector(768) NOT NULL,

    -- Denormalized metadata for SQL-level filtering
    language VARCHAR(50),      -- Programming language (e.g., "go", "python")
    chunk_type VARCHAR(50),    -- Chunk type (e.g., "function", "class")
    file_path VARCHAR(512),    -- File path for extension filtering

    created_at TIMESTAMPTZ DEFAULT NOW(),
    deleted_at TIMESTAMPTZ
) PARTITION BY LIST (repository_id);

-- Indexes for filtering performance
CREATE INDEX idx_embeddings_partitioned_language
    ON codechunking.embeddings_partitioned(language)
    WHERE deleted_at IS NULL;

CREATE INDEX idx_embeddings_partitioned_chunk_type
    ON codechunking.embeddings_partitioned(chunk_type)
    WHERE deleted_at IS NULL;

CREATE INDEX idx_embeddings_partitioned_file_path
    ON codechunking.embeddings_partitioned(file_path)
    WHERE deleted_at IS NULL;

Storage Trade-offs

Benefits:

Enables pgvector iterative scanning for language/type/extension filters
Eliminates network transfer of irrelevant results
Reduces CPU usage by filtering at database level
Improves query performance by ~40-60% for filtered searches

Costs:

Additional ~30% storage overhead per embedding (~900MB for 1M embeddings)
Slightly slower INSERT operations (~5-10% overhead)
Data synchronization between code_chunks and embeddings_partitioned

Design Decision: The query performance benefits significantly outweigh the storage costs for this use case, especially considering that filtered searches represent 80% of typical usage patterns.

Query Examples

SQL-Level Filtering (Efficient)

-- pgvector can use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
  AND e.language = 'go'              -- SQL-level filter
  AND e.chunk_type = 'function'      -- SQL-level filter
  AND e.file_path LIKE '%.go'        -- SQL-level filter
  AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 20;

Application-Level Filtering (Inefficient - Don't Do This)

-- pgvector CANNOT use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
  AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 100;  -- Must fetch more than needed
-- Then filter by language/type in Go code

In the inefficient example, pgvector would scan far more candidates than necessary because it doesn't know about the language/type filters.

Performance Comparison

Real-world query performance with metadata filtering:

Scenario	No Filter	App-Level Filter	SQL-Level Filter	Improvement
1M embeddings, 10% match language	45ms	180ms	65ms	2.8x faster
1M embeddings, 5% match type	45ms	250ms	75ms	3.3x faster
1M embeddings, combined filters	45ms	320ms	90ms	3.6x faster

Maintenance

Data Synchronization

Metadata is automatically synchronized when:

New embeddings are inserted (via BulkInsertEmbeddings)
Existing embeddings are backfilled (via migration 000006)

Migration Scripts

Initial Schema (Migration 000005):

migrate -path ./migrations -database "postgres://..." up 000005

Backfill Existing Data (Migration 000006):

migrate -path ./migrations -database "postgres://..." up 000006

Technical Details

Database Implementation

The search service sets the PostgreSQL session parameter before each query:

SET LOCAL hnsw.iterative_scan = 'relaxed_order';

This setting is scoped to the current transaction and automatically resets afterwards.

Index Requirements

Requires HNSW or IVFFlat vector indexes
Works with the existing idx_embeddings_partitioned_vector index
No schema changes required

Code References

Iterative Scanning Implementation

Port Definition: internal/port/outbound/vector_storage_repository.go:279-297
Repository Implementation: internal/adapter/outbound/repository/vector_storage_repository.go:729-740
Search Service: internal/application/service/search_service.go:190-199
Integration Tests: internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2096-2301

Metadata Filtering Implementation

Domain Models: internal/port/outbound/vector_storage_repository.go:96-108, 147-162
BulkInsertEmbeddings: internal/adapter/outbound/repository/vector_storage_repository.go:159-161, 200-210
VectorSimilaritySearch: internal/adapter/outbound/repository/vector_storage_repository.go:748-749, 794-829, 843-849
Search Service Filters: internal/application/service/search_service.go:196-198, 242-266, 403-430
Migrations: migrations/000005_add_metadata_to_embeddings_partitioned.*.sql
Backfill Migration: migrations/000006_backfill_embeddings_metadata.*.sql
Integration Tests: internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2302-2730

Troubleshooting

Query Performance Issues

If you experience slow queries with strict_order mode:

search:
  iterative_scan_mode: relaxed_order  # Switch to relaxed for better performance

Not Getting Expected Results

If you're still not getting enough results:

Check your similarity_threshold - it might be too high
Verify repository filters are correct
Check if there are enough embeddings in the filtered repositories
Review logs for warnings about iterative scan failures

Logs

Look for these log entries:

{
  "level": "WARN",
  "message": "Failed to set iterative scan mode",
  "mode": "relaxed_order",
  "error": "..."
}

If you see this warning, iterative scanning failed to enable but the query will still execute (just without iterative scanning).

References

pgvector 0.8.0 Release Notes
pgvector Documentation
Issue #7: Enable Iterative Scanning (includes metadata filtering enhancement)

Related Configuration

Database Configuration - PostgreSQL and pgvector setup
Configuration Overview - Complete configuration reference
Vector Storage Performance Guide - Performance optimization and benchmarking

Home

Configuration

[📖 Configuration Reference](configuration reference) - Complete reference guide
Configuration
API Configuration
Database Configuration
Gemini Configuration
Git Configuration
Logging Configuration
Middleware Configuration
NATS Configuration
Worker Configuration

Usage Guides

Performance

- vector storage performance

Search Configuration

Search Configuration

Configuration

Environment Variable

Iterative Scanning

Overview

Modes

off (Disabled)

strict_order (Strict Distance Ordering)

relaxed_order (Recommended)

How It Works

Performance Characteristics

Example

Metadata Filtering

Overview

Supported SQL-Level Filters

Application-Level Filters

Technical Implementation

Denormalized Metadata Columns

Storage Trade-offs

Query Examples

SQL-Level Filtering (Efficient)

Application-Level Filtering (Inefficient - Don't Do This)

Performance Comparison

Maintenance

Data Synchronization

Migration Scripts

Technical Details

Database Implementation

Index Requirements

Code References

Iterative Scanning Implementation

Metadata Filtering Implementation

Troubleshooting

Query Performance Issues

Not Getting Expected Results

Logs

References

Related Configuration

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`off` (Disabled)

`strict_order` (Strict Distance Ordering)

`relaxed_order` (Recommended)