Name	Name	Last commit message	Last commit date
parent directory ..
img	img
1_data_preparation.ipynb	1_data_preparation.ipynb
2_finetuning_and_inference.ipynb	2_finetuning_and_inference.ipynb
3_evaluation.ipynb	3_evaluation.ipynb
README.md	README.md
config.py	config.py
requirements.txt	requirements.txt

Embedding Fine-tuning with NVIDIA NeMo Microservices

Introduction

This guide shows how to fine-tune embedding models using the NVIDIA NeMo Microservices platform to improve performance on domain-specific tasks.

Why Fine-tune Embedding Models?

Retrieval quality determines AI application quality. Better retrieval means more accurate RAG (Retrieval Augmented Generation) responses, smarter agents, and more relevant search results. Embedding models power this retrieval by converting text into semantic vectors, but pre-trained models aren't optimized for your domain's specific vocabulary and context.

Fine-tuning adapts embedding models to your data (whether scientific literature, legal documents, or enterprise knowledge bases) to achieve measurably better retrieval performance. NeMo Microservices makes this practical by providing production-ready infrastructure that handles data preparation, training, deployment, and evaluation, letting you focus on improving models rather than building pipelines.

New to NeMo Microservices? Learn about Data Flywheel workflows in the main repository README or explore the NeMo Microservices documentation.

Embedding Fine-tuning workflow with NeMo Microservices

Figure 1: End-to-end workflow for fine-tuning embedding models using NeMo Microservices

The diagram above shows the embedding fine-tuning workflow that NeMo Microservices orchestrates:

Data Preparation: Download and format raw data locally into query-document triplets, then upload to the NeMo Data Store.
Fine-tuning: The NeMo Customizer service orchestrates training by launching a dedicated job that retrieves the base model and training data, performs supervised fine-tuning on GPU(s), and saves the fine-tuned weights to the Entity Store (model registry).
Deployment: The Deployment Management Service deploys the fine-tuned model as a NVIDIA Inference Microservice (NIM). It retrieves the model weights from the Entity Store and starts the NIM inference service.
Evaluation: The NeMo Evaluator service measures performance by querying the deployed NIM with benchmark tasks (such as Benchmarking Information Retrieval (BEIR) SciDocs) and calculating retrieval metrics like recall and NDCG.

This modular architecture enables each component to be independently scaled and managed.

Objectives

This tutorial shows how to use the NeMo Microservices platform to fine-tune the nvidia/llama-3.2-nv-embedqa-1b-v2 embedding model using the SPECTER dataset, then evaluate its performance on the BEIR Scidocs benchmark against baseline metrics.

By the end of this tutorial, you will:

Fine-tune an embedding model on scientific domain data
Deploy the fine-tuned model as a NIM
Evaluate retrieval performance on the BEIR Scidocs benchmark
Compare results against baseline model metrics to demonstrate measurable improvement

The tutorial covers the following steps:

About the SPECTER Dataset

The SPECTER dataset contains approximately 684K triplets from the scientific domain designed for training embedding models. Each triplet consists of:

Query: A paper title representing a search query
Positive: A related paper that should be retrieved (e.g., papers that cite each other)
Negative: An unrelated paper that should not be retrieved

Example triplet:

Query:    "Deep Residual Learning for Image Recognition"
Positive: "Identity Mappings in Deep Residual Networks"
Negative: "Attention Is All You Need"

During fine-tuning, the model learns through contrastive learning to maximize the similarity between the query and positive document while minimizing similarity with negative documents. This trains the model to produce embeddings that effectively capture semantic relationships in the scientific literature domain.

Prerequisites

Deploy NeMo Microservices

To follow this tutorial, you will need at least two NVIDIA GPUs:

Fine-tuning: One GPU for fine-tuning the llama-3.2-nv-embedqa-1b-v2 model with NeMo Customizer.
Inference: One GPU for deploying the fine-tuned model as a NIM.

If you're new to NeMo Microservices, follow the Demo Cluster Setup on Minikube guide to get started. For production deployments, refer to the platform prerequisites and installation guide.

NOTE: Fine-tuning for embedding models is supported starting with NeMo Microservices version 25.8.0. Please ensure you deploy NeMo Microservices Helm chart version 25.8.0 or later to use these notebooks.

Register the Base Model

After deploying NeMo Microservices, register the llama-3.2-nv-embedqa-1b-v2 base model with NeMo Customizer:

helm upgrade nemo nmp/nemo-microservices-helm-chart --namespace default --reuse-values \
  --set customizer.customizationTargets.overrideExistingTargets=false \
  --set 'customizer.customizationTargets.targets.nvidia/llama-3\.2-nv-embedqa-1b@v2.enabled=true' && \
kubectl delete pod -n default -l app.kubernetes.io/name=nemo-customizer && \
kubectl wait --for=condition=ready pod -l app.kubernetes.io/name=nemo-customizer -n default --timeout=5m

This restarts the customizer to register the model (~2-3 minutes). The base checkpoint downloads from NGC on first use.

Client-Side Requirements

Ensure you have access to:

A Python-enabled machine capable of running Jupyter Lab.
Network access to the NeMo Microservices IP and ports.

Get Started

Create a virtual environment using uv (recommended for better dependency management):

# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and activate virtual environment
uv venv nemo_env
source nemo_env/bin/activate

Install the required Python packages using requirements.txt with uv:
```
uv pip install -r requirements.txt
```
Update the following variables in config.py with your specific URLs and API keys.

How to obtain the required values:
- NeMo Microservices URLs: If you followed the Demo Cluster Setup on Minikube guide, run cat /etc/hosts on your deployment machine to view the configured service hostnames and IP addresses.
- Hugging Face Token: Generate a token at https://huggingface.co/settings/tokens to download the SPECTER dataset.
```
# (Required) NeMo Microservices URLs
NDS_URL = "http://data-store.test" # Data Store
NEMO_URL = "http://nemo.test" # Customizer, Entity Store, Evaluator
NIM_URL = "http://nim.test" # NIM

# (Required) Hugging Face Token
HF_TOKEN = ""

# (Optional) To observe training with WandB
WANDB_API_KEY = ""
```
Launch Jupyter Lab to begin working with the provided tutorials:
```
uv run jupyter lab --ip 0.0.0.0 --port=8888 --allow-root
```
Navigate to the data preparation notebook to get started.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Embedding Fine-tuning with NVIDIA NeMo Microservices

Introduction

Why Fine-tune Embedding Models?

Objectives

About the SPECTER Dataset

Prerequisites

Deploy NeMo Microservices

Register the Base Model

Client-Side Requirements

Get Started

FilesExpand file tree

embedding-finetuning

Directory actions

More options

Directory actions

More options

Latest commit

History

embedding-finetuning

Folders and files

parent directory

README.md

Embedding Fine-tuning with NVIDIA NeMo Microservices

Introduction

Why Fine-tune Embedding Models?

Objectives

About the SPECTER Dataset

Prerequisites

Deploy NeMo Microservices

Register the Base Model

Client-Side Requirements

Get Started