Skip to content

GochoMugo/rag-backend

Repository files navigation

rag-backend

RAG backend service for document chunking and embeddings using transformer models.

Features

  • Chunk HTML documents into heading-based sections or plain text
  • Generate embeddings using intfloat/e5-large-v2 transformer model
  • Async request handling with configurable concurrency

Endpoints

Endpoint Method Purpose
/chunks POST Chunk HTML or text documents
/embeddings POST Generate embeddings for passages/queries
/healthy GET Health check

Environment Variables

Variable Default Description
BATCH_SIZE 32 Embedding batch size
CHUNK_MAX_TOKENS 512 Default max tokens per chunk
DEVICE cuda or cpu Compute device
HTTP_PORT 8080 HTTP server port
MAX_DOCUMENTS 512 Max documents per embedding request
MAX_EMBEDDING_CONCURRENCY 0 Max concurrent embedding requests (0=unlimited)
MAX_LENGTH 512 Max token length per embedding
MODEL_CACHE_DIR ./data/model Model cache directory
MODEL_NAME intfloat/e5-large-v2 HuggingFace model for embeddings
WORDS_PER_TOKEN 0.75 Words-to-token ratio

API Reference

POST /chunks

Chunk a document into smaller pieces.

Request:

{
  "document": "string (required)",
  "documentType": "html" | "text",
  "maxTokensPerChunk": "int (1-4096, optional)",
  "chunkOverlapPercent": "float (0-1, optional)"
}

Response:

{
  "chunks": [
    {
      "heading": "string",
      "content": ["string"],
      "children": []
    }
  ]
}

POST /embeddings

Generate embeddings for documents.

Request:

{
  "documents": ["string (required, 1-512 items)"],
  "documentType": "passage" | "query",
  "maxLength": "int (1-4096, optional)",
  "batchSize": "int (1-128, optional)"
}

Response:

{
  "embeddings": [[float]],
  "model": "string",
  "stats": { "secs": float }
}

GET /health

Health check endpoint.

Response: {"ok": true}

Examples

# Health check
curl http://localhost:8080/healthy

# Chunk HTML document
curl -X POST http://localhost:8080/chunks \
  -H "Content-Type: application/json" \
  -d '{"document": "<h1>Title</h1><p>Content</p>", "documentType": "html"}'

# Chunk plain text
curl -X POST http://localhost:8080/chunks \
  -H "Content-Type: application/json" \
  -d '{"document": "Long text goes here...", "documentType": "text"}'

# Generate embeddings
curl -X POST http://localhost:8080/embeddings \
  -H "Content-Type: application/json" \
  -d '{"documents": ["The text to embed"], "documentType": "passage"}'

Run

# Development
uv run poe start

# Docker
docker run -p 8080:8080 ghcr.io/gochomugo/rag-backend:latest

Deployment

This service should be deployed behind a gateway (e.g., API Gateway, nginx, Traefik, Cloudflare) that provides:

  • Authentication - API key, OAuth, or similar access control
  • Rate limiting - Protect against abuse and ensure fair usage
  • Request size limits - Enforce maximum request body size to prevent memory exhaustion
  • Versioning - API versioning support (e.g., /v1/chunks, /v1/embeddings)

About

RAG backend service for document chunking and embeddings using transformer models.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors