Skip to content

PatrickChoDev/DotNetVectorSearch

Repository files navigation

DotNetVectorSearch

A comprehensive .NET solution for text embeddings and semantic similarity search using the E5 Multilingual model with ONNX Runtime.

Overview

This project provides a complete vector search implementation that includes:

  • Text embedding generation using E5 Multilingual model
  • Semantic similarity search capabilities
  • SQLite database for storing embeddings
  • REST API for easy integration
  • Batch processing for dataset preparation

Project Structure

DotNetVectorSearch/
├── DotNetVectorSearch.Core/          # Core library with embedding services
│   ├── Embeddings/                   # Embedding service implementations
│   ├── RuntimeProvider/              # ONNX runtime providers
│   └── Onnx/                        # Model files (see setup requirements)
├── DotNetVectorSearch.Prepare/       # Dataset preparation tool
├── DotNetVectorSearch.WebAPI/        # REST API service
└── DotNetVectorSearch.sln           # Solution file

Prerequisites

  • .NET 9.0 SDK
  • Model files (see setup requirements below)

Setup Requirements

1. ONNX Model Files

⚠️ Important: You need to place the following files in the DotNetVectorSearch.Core/Onnx/ directory:

  • model_O4.onnx - The E5 Multilingual ONNX model file
  • sentencepiece.bpe.model - The SentencePiece tokenizer model

Download the model files from: https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx

These files are required for the embedding service to function properly. The application will throw an exception if these files are missing.

2. Database File

The WebAPI project requires an SQLite database file named embeddings.db in the DotNetVectorSearch.WebAPI/ directory. This file is generated by running the Prepare project first.

Getting Started

1. Clone the Repository

git clone <repository-url>
cd DotNetVectorSearch

2. Add Required Model Files

Place the ONNX model files in the correct location:

DotNetVectorSearch.Core/Onnx/
├── model_O4.onnx
└── sentencepiece.bpe.model

3. Build the Solution

dotnet build

4. Prepare the Dataset (Optional)

If you have a dataset to process, place your dataset.csv file in the DotNetVectorSearch.Prepare/ directory and run:

cd DotNetVectorSearch.Prepare
dotnet run

This will:

  • Read the CSV dataset
  • Generate embeddings for each text entry
  • Create/update the embeddings.db SQLite database
  • Copy the database to the WebAPI project directory

5. Run the Web API

cd DotNetVectorSearch.WebAPI
dotnet run

The API will be available at https://localhost:7000 (or the port specified in your launch settings).

API Endpoints

Generate Single Embedding

POST /api/embeddings
Content-Type: application/json

{
  "text": "Your text here"
}

Generate Batch Embeddings

POST /api/embeddings/batch
Content-Type: application/json

{
  "texts": ["Text 1", "Text 2", "Text 3"]
}

Calculate Text Similarity

POST /api/similarity
Content-Type: application/json

{
  "text1": "First text",
  "text2": "Second text"
}

Search Similar Documents

POST /api/search
Content-Type: application/json

{
  "queryText": "Your search query",
  "topK": 10,
  "threshold": 0.7
}

Health Check

GET /health

Features

  • E5 Multilingual Support: Uses the state-of-the-art E5 multilingual embedding model
  • ONNX Runtime: Optimized inference using ONNX Runtime
  • SQLite Storage: Efficient storage and retrieval of embeddings
  • REST API: Easy integration with web applications
  • Batch Processing: Support for processing large datasets
  • Swagger Documentation: Interactive API documentation
  • CORS Support: Cross-origin resource sharing enabled
  • Health Checks: Built-in health monitoring

Configuration

API Configuration

The API can be configured through appsettings.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*"
}

Model Configuration

The embedding service is configured in the E5MultilingualEmbeddings class:

  • Maximum sequence length: 512 tokens
  • Model path: DotNetVectorSearch.Core/Onnx/model_O4.onnx
  • Tokenizer path: DotNetVectorSearch.Core/Onnx/sentencepiece.bpe.model

Development

Project Dependencies

  • DotNetVectorSearch.Core: Core embedding functionality
  • DotNetVectorSearch.Prepare: Depends on Core
  • DotNetVectorSearch.WebAPI: Depends on Core

Key Technologies

  • .NET 9.0
  • Microsoft.ML.OnnxRuntime
  • Microsoft.ML.Tokenizers
  • SQLite
  • ASP.NET Core Web API
  • Swagger/OpenAPI

Troubleshooting

Common Issues

  1. Model files not found: Ensure model_O4.onnx and sentencepiece.bpe.model are in the DotNetVectorSearch.Core/Onnx/ directory
  2. Database not found: Run the Prepare project first to generate the embeddings.db file
  3. Memory issues: The ONNX model requires sufficient RAM; consider adjusting batch sizes for large datasets

Logging

The application uses Microsoft.Extensions.Logging for comprehensive logging. Check the console output for detailed error messages and debugging information.

Contact

Acknowledgments

  • E5 Multilingual model for text embeddings
  • Microsoft ONNX Runtime team
  • .NET community

About

Demonstration for Vector Search Engine without cloud dependency @ .NET Meetup Thailand 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages