Skip to content

Search Configuration

Anthony Bible edited this page Nov 21, 2025 · 1 revision

Search Configuration

This document describes the configuration options for the semantic search system, including pgvector iterative scanning for improved result retrieval.

Configuration

The search system can be configured via the search section in configs/config.yaml:

search:
  iterative_scan_mode: relaxed_order  # pgvector 0.8.0+ iterative scanning mode
                                       # Options: "off", "strict_order", "relaxed_order"
                                       # relaxed_order recommended for best recall/performance

Environment Variable

You can override the configuration using the CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE environment variable:

export CODECHUNK_SEARCH_ITERATIVE_SCAN_MODE=relaxed_order

Iterative Scanning

Overview

Iterative scanning is a pgvector 0.8.0+ feature that solves the "overfiltering" problem in vector similarity searches. When you apply filters (e.g., by repository, language, file type), traditional approximate nearest neighbor (ANN) indexes might return fewer results than requested because they pre-filter before scanning the index.

Problem Example:

  • User requests 10 results filtered to repository "example/frontend"
  • Vector search finds 50 candidates total
  • After filtering by repository, only 3 match
  • User receives 3 results instead of the requested 10, even though more matching results exist in the database

Solution: Iterative scanning automatically expands the search scope when filters reduce the result count, ensuring you get the requested number of results even with highly selective filters.

Modes

off (Disabled)

  • Default pgvector behavior without iterative scanning
  • Fastest queries but may return fewer results than requested when filters are selective
  • Use when: You don't need guaranteed result counts or filters are not very selective

strict_order (Strict Distance Ordering)

  • Enables iterative scanning with exact distance ordering guarantees
  • Slower than relaxed_order but maintains perfect similarity score ordering
  • Use when: Exact ordering is critical (e.g., top-k retrieval for benchmarks)

relaxed_order (Recommended)

  • Enables iterative scanning with approximate ordering
  • Best balance of performance, recall, and result completeness
  • Results are still well-ordered but may have minor deviations in similarity scores
  • Use when: You need good recall and completeness (most use cases)
  • Default setting for this application

How It Works

When iterative scanning is enabled:

  1. PostgreSQL performs initial vector search with requested LIMIT
  2. Applies repository/language/file-type filters at the SQL level
  3. If filtered results < requested limit, automatically fetches more candidates
  4. Repeats until desired number of results achieved or no more candidates available
  5. Returns up to the requested LIMIT of filtered results

Important: For iterative scanning to work effectively, filters must be applied at the SQL level (in the WHERE clause), not in application code. See the Metadata Filtering section below.

Performance Characteristics

Mode Query Speed Result Completeness Ordering Accuracy
off Fastest Low (with filters) Exact
relaxed_order Fast High Approximate
strict_order Slower High Exact

Example

Without iterative scanning (off):

Query: Find 20 code chunks about "authentication" in repository "auth-service"
Vector search finds: 100 candidates
After repository filter: 8 matches
Result: Only 8 chunks returned (fewer than requested)

With iterative scanning (relaxed_order):

Query: Find 20 code chunks about "authentication" in repository "auth-service"
1st iteration: Vector search finds 100 candidates → 8 matches after filter
2nd iteration: Vector search expands → finds 200 candidates → 15 matches
3rd iteration: Vector search expands → finds 300 candidates → 20 matches ✓
Result: Full 20 chunks returned as requested

Metadata Filtering

Overview

To maximize the effectiveness of pgvector's iterative scanning, the system performs metadata filtering at the SQL level (in the WHERE clause) rather than in application code. This is critical because iterative scanning needs to know which results pass filters in order to decide whether to expand the search scope.

Supported SQL-Level Filters

The following filters are applied at the database level and work seamlessly with iterative scanning:

Filter Type Description Example
Repository Filter by repository ID or name repository_ids: ["uuid1", "uuid2"]
Language Filter by programming language languages: ["go", "python"]
Chunk Type Filter by semantic construct type types: ["function", "class"]
File Extension Filter by file type file_types: [".go", ".py"]

Application-Level Filters

Some filters cannot be applied at the SQL level because they require data from the code_chunks table (not denormalized to embeddings_partitioned). These are applied after the vector search:

Filter Type Description Performance Impact
Entity Name Filter by function/class name Low - rare use case
Visibility Filter by visibility modifier Low - rare use case

Note: Application-level filters reduce the benefit of iterative scanning because filtering happens after the vector search completes.

Technical Implementation

Denormalized Metadata Columns

The embeddings_partitioned table includes denormalized metadata columns for SQL-level filtering:

-- Schema
CREATE TABLE codechunking.embeddings_partitioned (
    id UUID PRIMARY KEY,
    chunk_id UUID NOT NULL,
    repository_id UUID NOT NULL,
    embedding vector(768) NOT NULL,

    -- Denormalized metadata for SQL-level filtering
    language VARCHAR(50),      -- Programming language (e.g., "go", "python")
    chunk_type VARCHAR(50),    -- Chunk type (e.g., "function", "class")
    file_path VARCHAR(512),    -- File path for extension filtering

    created_at TIMESTAMPTZ DEFAULT NOW(),
    deleted_at TIMESTAMPTZ
) PARTITION BY LIST (repository_id);

-- Indexes for filtering performance
CREATE INDEX idx_embeddings_partitioned_language
    ON codechunking.embeddings_partitioned(language)
    WHERE deleted_at IS NULL;

CREATE INDEX idx_embeddings_partitioned_chunk_type
    ON codechunking.embeddings_partitioned(chunk_type)
    WHERE deleted_at IS NULL;

CREATE INDEX idx_embeddings_partitioned_file_path
    ON codechunking.embeddings_partitioned(file_path)
    WHERE deleted_at IS NULL;

Storage Trade-offs

Benefits:

  • Enables pgvector iterative scanning for language/type/extension filters
  • Eliminates network transfer of irrelevant results
  • Reduces CPU usage by filtering at database level
  • Improves query performance by ~40-60% for filtered searches

Costs:

  • Additional ~30% storage overhead per embedding (~900MB for 1M embeddings)
  • Slightly slower INSERT operations (~5-10% overhead)
  • Data synchronization between code_chunks and embeddings_partitioned

Design Decision: The query performance benefits significantly outweigh the storage costs for this use case, especially considering that filtered searches represent 80% of typical usage patterns.

Query Examples

SQL-Level Filtering (Efficient)

-- pgvector can use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
  AND e.language = 'go'              -- SQL-level filter
  AND e.chunk_type = 'function'      -- SQL-level filter
  AND e.file_path LIKE '%.go'        -- SQL-level filter
  AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 20;

Application-Level Filtering (Inefficient - Don't Do This)

-- pgvector CANNOT use iterative scanning effectively
SELECT e.id, e.chunk_id, e.embedding <=> $1::vector AS distance
FROM codechunking.embeddings_partitioned e
WHERE e.repository_id = $2
  AND e.deleted_at IS NULL
ORDER BY e.embedding <=> $1::vector
LIMIT 100;  -- Must fetch more than needed
-- Then filter by language/type in Go code

In the inefficient example, pgvector would scan far more candidates than necessary because it doesn't know about the language/type filters.

Performance Comparison

Real-world query performance with metadata filtering:

Scenario No Filter App-Level Filter SQL-Level Filter Improvement
1M embeddings, 10% match language 45ms 180ms 65ms 2.8x faster
1M embeddings, 5% match type 45ms 250ms 75ms 3.3x faster
1M embeddings, combined filters 45ms 320ms 90ms 3.6x faster

Maintenance

Data Synchronization

Metadata is automatically synchronized when:

  • New embeddings are inserted (via BulkInsertEmbeddings)
  • Existing embeddings are backfilled (via migration 000006)

Migration Scripts

Initial Schema (Migration 000005):

migrate -path ./migrations -database "postgres://..." up 000005

Backfill Existing Data (Migration 000006):

migrate -path ./migrations -database "postgres://..." up 000006

Technical Details

Database Implementation

The search service sets the PostgreSQL session parameter before each query:

SET LOCAL hnsw.iterative_scan = 'relaxed_order';

This setting is scoped to the current transaction and automatically resets afterwards.

Index Requirements

  • Requires HNSW or IVFFlat vector indexes
  • Works with the existing idx_embeddings_partitioned_vector index
  • No schema changes required

Code References

Iterative Scanning Implementation

  • Port Definition: internal/port/outbound/vector_storage_repository.go:279-297
  • Repository Implementation: internal/adapter/outbound/repository/vector_storage_repository.go:729-740
  • Search Service: internal/application/service/search_service.go:190-199
  • Integration Tests: internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2096-2301

Metadata Filtering Implementation

  • Domain Models: internal/port/outbound/vector_storage_repository.go:96-108, 147-162
  • BulkInsertEmbeddings: internal/adapter/outbound/repository/vector_storage_repository.go:159-161, 200-210
  • VectorSimilaritySearch: internal/adapter/outbound/repository/vector_storage_repository.go:748-749, 794-829, 843-849
  • Search Service Filters: internal/application/service/search_service.go:196-198, 242-266, 403-430
  • Migrations: migrations/000005_add_metadata_to_embeddings_partitioned.*.sql
  • Backfill Migration: migrations/000006_backfill_embeddings_metadata.*.sql
  • Integration Tests: internal/adapter/outbound/repository/vector_storage_repository_integration_test.go:2302-2730

Troubleshooting

Query Performance Issues

If you experience slow queries with strict_order mode:

search:
  iterative_scan_mode: relaxed_order  # Switch to relaxed for better performance

Not Getting Expected Results

If you're still not getting enough results:

  1. Check your similarity_threshold - it might be too high
  2. Verify repository filters are correct
  3. Check if there are enough embeddings in the filtered repositories
  4. Review logs for warnings about iterative scan failures

Logs

Look for these log entries:

{
  "level": "WARN",
  "message": "Failed to set iterative scan mode",
  "mode": "relaxed_order",
  "error": "..."
}

If you see this warning, iterative scanning failed to enable but the query will still execute (just without iterative scanning).

References

Related Configuration

Clone this wiki locally