rag-backend

RAG backend service for document chunking and embeddings using transformer models.

Features

Chunk HTML documents into heading-based sections or plain text
Generate embeddings using intfloat/e5-large-v2 transformer model
Async request handling with configurable concurrency

Endpoints

Endpoint	Method	Purpose
`/chunks`	POST	Chunk HTML or text documents
`/embeddings`	POST	Generate embeddings for passages/queries
`/healthy`	GET	Health check

Environment Variables

Variable	Default	Description
`BATCH_SIZE`	`32`	Embedding batch size
`CHUNK_MAX_TOKENS`	`512`	Default max tokens per chunk
`DEVICE`	`cuda` or `cpu`	Compute device
`HTTP_PORT`	`8080`	HTTP server port
`MAX_DOCUMENTS`	`512`	Max documents per embedding request
`MAX_EMBEDDING_CONCURRENCY`	`0`	Max concurrent embedding requests (0=unlimited)
`MAX_LENGTH`	`512`	Max token length per embedding
`MODEL_CACHE_DIR`	`./data/model`	Model cache directory
`MODEL_NAME`	`intfloat/e5-large-v2`	HuggingFace model for embeddings
`WORDS_PER_TOKEN`	`0.75`	Words-to-token ratio

API Reference

POST `/chunks`

Chunk a document into smaller pieces.

Request:

{
  "document": "string (required)",
  "documentType": "html" | "text",
  "maxTokensPerChunk": "int (1-4096, optional)",
  "chunkOverlapPercent": "float (0-1, optional)"
}

Response:

{
  "chunks": [
    {
      "heading": "string",
      "content": ["string"],
      "children": []
    }
  ]
}

POST `/embeddings`

Generate embeddings for documents.

Request:

{
  "documents": ["string (required, 1-512 items)"],
  "documentType": "passage" | "query",
  "maxLength": "int (1-4096, optional)",
  "batchSize": "int (1-128, optional)"
}

Response:

{
  "embeddings": [[float]],
  "model": "string",
  "stats": { "secs": float }
}

GET `/health`

Health check endpoint.

Response: {"ok": true}

Examples

# Health check
curl http://localhost:8080/healthy

# Chunk HTML document
curl -X POST http://localhost:8080/chunks \
  -H "Content-Type: application/json" \
  -d '{"document": "<h1>Title</h1><p>Content</p>", "documentType": "html"}'

# Chunk plain text
curl -X POST http://localhost:8080/chunks \
  -H "Content-Type: application/json" \
  -d '{"document": "Long text goes here...", "documentType": "text"}'

# Generate embeddings
curl -X POST http://localhost:8080/embeddings \
  -H "Content-Type: application/json" \
  -d '{"documents": ["The text to embed"], "documentType": "passage"}'

Run

# Development
uv run poe start

# Docker
docker run -p 8080:8080 ghcr.io/gochomugo/rag-backend:latest

Deployment

This service should be deployed behind a gateway (e.g., API Gateway, nginx, Traefik, Cloudflare) that provides:

Authentication - API key, OAuth, or similar access control
Rate limiting - Protect against abuse and ensure fair usage
Request size limits - Enforce maximum request body size to prevent memory exhaustion
Versioning - API versioning support (e.g., /v1/chunks, /v1/embeddings)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

rag-backend

Features

Endpoints

Environment Variables

API Reference

POST `/chunks`

POST `/embeddings`

GET `/health`

Examples

Run

Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

rag-backend

Features

Endpoints

Environment Variables

API Reference

POST /chunks

POST /embeddings

GET /health

Examples

Run

Deployment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

POST `/chunks`

POST `/embeddings`

GET `/health`

Packages