DotNetVectorSearch

A comprehensive .NET solution for text embeddings and semantic similarity search using the E5 Multilingual model with ONNX Runtime.

Overview

This project provides a complete vector search implementation that includes:

Text embedding generation using E5 Multilingual model
Semantic similarity search capabilities
SQLite database for storing embeddings
REST API for easy integration
Batch processing for dataset preparation

Project Structure

DotNetVectorSearch/
├── DotNetVectorSearch.Core/          # Core library with embedding services
│   ├── Embeddings/                   # Embedding service implementations
│   ├── RuntimeProvider/              # ONNX runtime providers
│   └── Onnx/                        # Model files (see setup requirements)
├── DotNetVectorSearch.Prepare/       # Dataset preparation tool
├── DotNetVectorSearch.WebAPI/        # REST API service
└── DotNetVectorSearch.sln           # Solution file

Prerequisites

.NET 9.0 SDK
Model files (see setup requirements below)

Setup Requirements

1. ONNX Model Files

⚠️ Important: You need to place the following files in the DotNetVectorSearch.Core/Onnx/ directory:

model_O4.onnx - The E5 Multilingual ONNX model file
sentencepiece.bpe.model - The SentencePiece tokenizer model

Download the model files from: https://huggingface.co/intfloat/multilingual-e5-small/tree/main/onnx

These files are required for the embedding service to function properly. The application will throw an exception if these files are missing.

2. Database File

The WebAPI project requires an SQLite database file named embeddings.db in the DotNetVectorSearch.WebAPI/ directory. This file is generated by running the Prepare project first.

Getting Started

1. Clone the Repository

git clone <repository-url>
cd DotNetVectorSearch

2. Add Required Model Files

Place the ONNX model files in the correct location:

DotNetVectorSearch.Core/Onnx/
├── model_O4.onnx
└── sentencepiece.bpe.model

3. Build the Solution

dotnet build

4. Prepare the Dataset (Optional)

If you have a dataset to process, place your dataset.csv file in the DotNetVectorSearch.Prepare/ directory and run:

cd DotNetVectorSearch.Prepare
dotnet run

This will:

Read the CSV dataset
Generate embeddings for each text entry
Create/update the embeddings.db SQLite database
Copy the database to the WebAPI project directory

5. Run the Web API

cd DotNetVectorSearch.WebAPI
dotnet run

The API will be available at https://localhost:7000 (or the port specified in your launch settings).

API Endpoints

Generate Single Embedding

POST /api/embeddings
Content-Type: application/json

{
  "text": "Your text here"
}

Generate Batch Embeddings

POST /api/embeddings/batch
Content-Type: application/json

{
  "texts": ["Text 1", "Text 2", "Text 3"]
}

Calculate Text Similarity

POST /api/similarity
Content-Type: application/json

{
  "text1": "First text",
  "text2": "Second text"
}

Search Similar Documents

POST /api/search
Content-Type: application/json

{
  "queryText": "Your search query",
  "topK": 10,
  "threshold": 0.7
}

Health Check

GET /health

Features

E5 Multilingual Support: Uses the state-of-the-art E5 multilingual embedding model
ONNX Runtime: Optimized inference using ONNX Runtime
SQLite Storage: Efficient storage and retrieval of embeddings
REST API: Easy integration with web applications
Batch Processing: Support for processing large datasets
Swagger Documentation: Interactive API documentation
CORS Support: Cross-origin resource sharing enabled
Health Checks: Built-in health monitoring

Configuration

API Configuration

The API can be configured through appsettings.json:

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "AllowedHosts": "*"
}

Model Configuration

The embedding service is configured in the E5MultilingualEmbeddings class:

Maximum sequence length: 512 tokens
Model path: DotNetVectorSearch.Core/Onnx/model_O4.onnx
Tokenizer path: DotNetVectorSearch.Core/Onnx/sentencepiece.bpe.model

Development

Project Dependencies

DotNetVectorSearch.Core: Core embedding functionality
DotNetVectorSearch.Prepare: Depends on Core
DotNetVectorSearch.WebAPI: Depends on Core

Key Technologies

.NET 9.0
Microsoft.ML.OnnxRuntime
Microsoft.ML.Tokenizers
SQLite
ASP.NET Core Web API
Swagger/OpenAPI

Troubleshooting

Common Issues

Model files not found: Ensure model_O4.onnx and sentencepiece.bpe.model are in the DotNetVectorSearch.Core/Onnx/ directory
Database not found: Run the Prepare project first to generate the embeddings.db file
Memory issues: The ONNX model requires sufficient RAM; consider adjusting batch sizes for large datasets

Logging

The application uses Microsoft.Extensions.Logging for comprehensive logging. Check the console output for detailed error messages and debugging information.

Contact

Author: PatrickChoDev
Email: [email protected]
GitHub: https://github.com/PatrickChoDev

Acknowledgments

E5 Multilingual model for text embeddings
Microsoft ONNX Runtime team
.NET community

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
DotNetVectorSearch.Chat		DotNetVectorSearch.Chat
DotNetVectorSearch.Core		DotNetVectorSearch.Core
DotNetVectorSearch.Prepare		DotNetVectorSearch.Prepare
DotNetVectorSearch.WebAPI		DotNetVectorSearch.WebAPI
.gitignore		.gitignore
DotNetVectorSearch.sln		DotNetVectorSearch.sln
DotNetVectorSearch.sln.DotSettings.user		DotNetVectorSearch.sln.DotSettings.user
README.md		README.md
global.json		global.json
mise.toml		mise.toml

PatrickChoDev/DotNetVectorSearch

Folders and files

Latest commit

History

Repository files navigation