Skip to content

Ckrest/model-manager

Repository files navigation

Model Manager

Centralized model broker and job runner for local and remote models. It exposes one service on port 5001 with two execution styles:

  • synchronous chat broker via /api/chat/completions
  • asynchronous queued jobs via /api/submit

What It Supports

  • Canonical routed model IDs such as ollama/qwen3:8b, openrouter/z-ai/glm-5.1, and openrouter/anthropic/claude-sonnet-4.6
  • Local Ollama text and vision models discovered from /api/tags
  • OpenRouter-backed remote models, including Anthropic and GLM routes
  • Z.AI-backed remote text/code/tool routes via the coding endpoint
  • Systems history telemetry for brokered chat requests and queued job execution

Quick Start

  1. Ensure Ollama is running on localhost:11434 for local models.
  2. Add any remote models you want to expose in config.local.yaml.
  3. Start the service:
systemctl --user start model-manager
  1. Call the broker:
curl -X POST http://localhost:5001/api/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai/glm-5",
    "messages": [{"role": "user", "content": "Reply with exactly OK"}]
  }'
  1. Or submit an async job:
curl -X POST http://localhost:5001/api/submit \
  -H "Content-Type: application/json" \
  -d '{"model": "ollama/qwen3:8b", "prompt": "Hello world"}'
  1. Poll the job:
curl http://localhost:5001/api/job/<job_id>

Remote Model Configuration

Create config.local.yaml for site-specific settings. Example routed model setup:

REMOTE_MODELS:
  - name: "openrouter/anthropic/claude-sonnet-4.6"
    provider: "openrouter"
    upstream_model: "anthropic/claude-sonnet-4.6"
    capabilities: ["text"]
    family: "claude"
    parameter_size: "remote"
    metadata:
      label: "Claude Sonnet 4.6 via OpenRouter"
  - name: "zai/glm-5"
    provider: "zai"
    upstream_model: "glm-5"
    capabilities: ["text", "code", "tools"]
    family: "glm"
    parameter_size: "remote"
    metadata:
      label: "GLM-5 via Z.AI coding endpoint"
  - name: "openrouter/z-ai/glm-5"
    provider: "openrouter"
    upstream_model: "z-ai/glm-5"
    capabilities: ["text", "code", "tools"]
    family: "glm"
    parameter_size: "remote"
    metadata:
      label: "GLM-5 via OpenRouter"
  - name: "openrouter/z-ai/glm-5.1"
    provider: "openrouter"
    upstream_model: "z-ai/glm-5.1"
    capabilities: ["text", "code", "tools"]
    family: "glm"
    parameter_size: "remote"
    metadata:
      label: "GLM-5.1 via OpenRouter"

OPENROUTER_API_KEYS_JSON: "$SYSTEMS_ROOT/ai/modular-flow-engine/config/api_keys.json"
OPENROUTER_HTTP_REFERER: "https://localhost/model-manager"
OPENROUTER_X_TITLE: "model-manager"
OPENROUTER_TIMEOUT_SECONDS: 180

OpenRouter credentials can come from one of:

  • OPENROUTER_API_KEY in config.local.yaml
  • OPENROUTER_API_KEYS_JSON pointing at a JSON file with an openrouter key
  • OPENROUTER_API_KEY_FILE pointing at a plain-text key file
  • OPENROUTER_API_KEY in the environment

Z.AI credentials can come from one of:

  • ZAI_API_KEY in config.local.yaml
  • ZAI_API_KEYS_JSON pointing at a JSON file with a zai or glm key
  • ZAI_API_KEY_FILE pointing at a plain-text key file
  • ZAI_API_KEY in the environment

The default bind host is 127.0.0.1. Override HTTP_HOST in config.local.yaml only if you intentionally want network exposure.

HTTP API

Endpoint Method Description
/api/chat/completions POST Execute synchronous brokered chat completion
/api/submit POST Submit inference job
/api/job/<job_id> GET Get job status/result
/api/models GET List available models across providers
/api/models/refresh POST Refresh model list
/api/stats GET Queue and resource statistics
/api/health GET Health check

Broker Chat

curl -X POST http://localhost:5001/api/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama/qwen3:8b",
    "messages": [{"role": "user", "content": "Reply with exactly OK"}]
  }'

Submit Job

curl -X POST http://localhost:5001/api/submit \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openrouter/anthropic/claude-sonnet-4.6",
    "prompt": "Answer in one line.",
    "system_prompt": "Be concise.",
    "priority": "high",
    "metadata": {"source": "manual-test"}
  }'

List Models

curl http://localhost:5001/api/models

Local models are exposed as canonical ollama/... routes. Remote configured models are exposed exactly as declared in REMOTE_MODELS. Use upstream_model to define the actual provider model ID behind each route.

CLI Tool

The ./cli tool provides commands for batch queries and vision analysis. It communicates with the running Model Manager service.

./cli batch-query items.txt "Is {item} a tool?"
./cli analyze photo.png
./cli quick photo.png "What color is the car?"
./cli count photo.png "people"
./cli --version
./cli --print-defaults
./cli --print-resolved
./cli --print-config-schema
./cli --validate-config

Architecture

  • HTTP API — synchronous broker plus async job queue on port 5001
  • Queue Manager — priority-based storage and batching
  • VRAM Scheduler — local-model load decisions based on GPU memory
  • Execution Engine — provider-aware inference execution
  • Model Registry — local discovery plus configured remote models
  • Resource Monitor — local GPU state via nvidia-smi and Ollama /api/ps

Remote models do not participate in local VRAM load/unload decisions. Exact routed broker requests are executed immediately. The async job API still uses the queue and job tracking path.

Development

cd "$SYSTEMS_ROOT/ai/model-manager"
PYTHONPATH=src python3 -m unittest discover -s tests
python3 -m compileall src tests

License

MIT License. See LICENSE.

About

VRAM-aware Ollama job scheduler with intelligent model loading/unloading and priority-based queue management.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors