Skip to content

prem-research/mlx-tinker

Repository files navigation

mlx-tinker

Proof-of-concept local Tinker backend for Apple Silicon that can actually keep learning. Run Qwen3.5 locally on a MacBook, plug it into an agent runtime, and do continual RL updates from real agent trajectories.

mlx-tinker implements the Tinker API on top of Apple's MLX framework. The interesting part is not just local inference: this repo now has working local continual-learning loops for both OpenClaw and Hermes Agent, with reward flowing back into local PPO-style updates on Apple Silicon.

Local Continual RL on a MacBook

In the validated OpenClaw setup below, WildClawBench task containers run OpenClaw, OpenClaw-RL scores the resulting trajectories, and mlx-tinker applies PPO updates locally on a MacBook.

  • The plotted run is an end-to-end local OpenClaw loop -> tasks sourced from WildClawBench.
  • In that run, the system completed 32 PPO steps and scored 39 trajectories locally.
  • Reward moves off the floor and positive-reward steps start appearing in the back half of training.
  • The same local stack also supports live continual learning from real agent sessions.
  • This work was validated on an M4 MacBook Pro with 24GB unified memory.

WildClawBench RL

The currently validated OpenClaw-RL dependency is the fork branch ojus1/OpenClaw-RL@codex/qwen35-openclaw-tinker. mlx-tinker bootstraps that automatically in the managed OpenClaw path, so you do not need to wait for upstream PR timing to use the local-learning stack.

Multi-turn agent use is practical because mlx-tinker is not recomputing every long prompt from scratch on every turn. It uses disk-backed transcript prefix caching to offload reusable prompt/KV state locally, so repeated system prompts, tool schemas, and conversation prefixes can be restored instead of rebuilt. That is paired with quantized KV cache support for in-memory generation and gradient checkpointing for training-time memory savings, which is what makes longer agent sessions and local continual RL workable on a MacBook.

Choose Your Integration

OpenClaw

This is the most complete path today. It has the best onboarding story and the most thorough validation.

OpenClaw Requirements And First Install
  • macOS with Apple Silicon
  • Python 3.12+
  • uv
  • git
  • node
  • Docker Desktop

Recommended models:

  • Qwen/Qwen3.5-4B on 24GB+ Macs
  • Qwen/Qwen3.5-0.8B on smaller-memory Macs

Install the repo:

git clone https://github.com/ojus1/mlx-tinker.git
cd mlx-tinker
uv sync

Then start the managed local-learning stack:

uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B

The first run downloads the model, clones the required external repos into .external/ (OpenClaw and the supported OpenClaw-RL fork), and builds the local OpenClaw gateway image, so expect it to take a few minutes.

That one command starts three pieces:

  • native mlx-tinker inference + training backend
  • native OpenClaw-RL proxy/trainer
  • Dockerized OpenClaw gateway

It also patches OpenClaw to use the stable local model alias mlx-tinker-local/local-primary, installs the RL header plugin, and stores managed runtime state under ~/.openclaw/mlx-tinker/.

New OpenClaw Users
uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B
uv run python -m mlx_tinker openclaw status

After setup:

  • OpenClaw is available on the local gateway port shown by status
  • the default model already points at the local learning backend
  • local webchat/CLI sessions stay inline instead of trying to route through outbound messaging tools
Existing OpenClaw Users

Run the same command:

uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B

The managed setup preserves your existing OpenClaw installation:

  • it backs up the current ~/.openclaw/openclaw.json
  • it keeps your channels, other agent settings, and workspace defaults intact
  • it only switches the model/backend path over to the managed local-learning stack
OpenClaw Service Commands
uv run python -m mlx_tinker openclaw status
uv run python -m mlx_tinker openclaw logs --service all
uv run python -m mlx_tinker openclaw start
uv run python -m mlx_tinker openclaw stop

Hermes Agent

It is not yet a one-command managed onboarding experience like OpenClaw.

Validated today:

  • live RL with Hermes Agent -> Hermes RL bridge -> mlx-tinker
  • new-user flow on a fresh HERMES_HOME
  • existing-user / resumed-session flow on a persisted HERMES_HOME
  • end-to-end local training on Qwen/Qwen3.5-2B

Not yet claimed:

  • end-to-end Hermes combine
  • end-to-end Hermes opd
  • polished Docker-managed Hermes onboarding
Hermes Requirements And First Install
  • macOS with Apple Silicon
  • Python 3.12+
  • uv
  • git

Install mlx-tinker:

git clone https://github.com/ojus1/mlx-tinker.git
cd mlx-tinker
uv sync

Clone the Hermes fork with the minimal live-RL header patch:

git clone https://github.com/ojus1/hermes-agent.git .external/hermes-agent
git -C .external/hermes-agent checkout codex/mlx-tinker-live-rl

For the currently validated Hermes PoC settings, start the local stack like this:

MODEL_NAME='Qwen/Qwen3.5-2B' \
HERMES_RL_MAX_CONTEXT_TOKENS=2048 \
HERMES_RL_LORA_RANK=8 \
bash scripts/run_hermes_rl.sh

That helper starts:

  • native mlx-tinker
  • the Hermes RL bridge at http://127.0.0.1:30050/v1
  • a local training loop listening for Hermes traffic
New Hermes Users

Point Hermes at a fresh home directory:

HERMES_HOME=~/hermes-poc-new \
HERMES_RL_ENABLED=1 \
HERMES_RL_PROXY_BASE_URL=http://127.0.0.1:30050/v1 \
uv run --directory .external/hermes-agent python run_agent.py \
  --model hermes-local \
  --base_url http://127.0.0.1:30050/v1 \
  --api_key hermes-local
Existing Hermes Users

Keep your current Hermes home and point Hermes at the bridge:

HERMES_HOME=~/.hermes \
HERMES_RL_ENABLED=1 \
HERMES_RL_PROXY_BASE_URL=http://127.0.0.1:30050/v1 \
uv run --directory .external/hermes-agent python run_agent.py \
  --model hermes-local \
  --base_url http://127.0.0.1:30050/v1 \
  --api_key hermes-local

The live-learning signal is header-based, not transcript scraping. Main Hermes model calls are tagged with X-Session-Id, X-Hermes-Outer-Turn-Id, X-Hermes-Step-Index, X-Turn-Type, and X-Hermes-Request-Id, which lets the bridge reconstruct multi-step tool-use trajectories, dedupe retries, and score turns against the next state.

How To Tell Learning Is Actually Happening

Serving and training are separate things, so the right thing to check is the proxy/trainer log.

OpenClaw

uv run python -m mlx_tinker openclaw logs --service proxy

Look for lines like:

  • submitted session=...
  • drained 1 groups
  • forward_backward
  • optim_step

OpenClaw training records are written under:

  • ~/.openclaw/mlx-tinker/records/conversations.jsonl
  • ~/.openclaw/mlx-tinker/records/prm_scores.jsonl

Hermes Agent

tail -f /tmp/hermes_rl_bridge.log
tail -f /tmp/hermes_mlx_tinker.log

Look for the same progression:

  • submitted session=...
  • prm_eval_score=...
  • step 1: forward_backward
  • step 1: optim_step
  • training complete

One important nuance for both integrations: the current RL path scores a turn against the next state, so a single isolated one-turn chat will not train immediately. Once the same session gets a follow-up user turn, the previous turn can be scored and submitted into PPO.

The loop is also mostly asynchronous. Inference stays live during batch collection, PRM scoring, forward_backward, and optim_step. The one deliberate pause is the weight swap: after an optimizer step, the proxy briefly pauses new submissions while it installs the updated sampling client, then resumes normal traffic.

Important Configs For Real Multi-Turn Use

If you want the local agent loop to feel good on longer sessions, these are the knobs that matter most:

  • --max-context-tokens on the RL side controls how much context each training datum keeps before truncation.
  • --prefix-cache-disk-limit-gb on mlx-tinker controls how much disk space is available for transcript prefix caching.
  • --kv-cache-bits and --kv-cache-group-size control KV-cache quantization for inference.
  • --quantized-kv-start controls when KV-cache quantization begins.
  • --checkpoints controls where LoRA checkpoints and prefix-cache artifacts are stored.
  • --max-batch-size and --cycle-ms are the backend scheduling knobs.

Current validated defaults:

  • OpenClaw managed path:
    • RL batch size: 1
    • RL max context tokens: 8192
    • gateway bind: lan
  • Hermes PoC path:
    • model: Qwen/Qwen3.5-2B
    • LoRA rank: 8
    • RL batch size: 1
    • RL max context tokens: 2048

Use mlx-tinker as a Plain Tinker Backend

If you do not want the OpenClaw stack and only want a local Tinker-compatible server, that path still works too:

uv run python -m mlx_tinker --model Qwen/Qwen3.5-0.8B

Then point the normal Tinker SDK at localhost:

python -c "
import tinker
client = tinker.ServiceClient(base_url='http://localhost:8080', api_key='local')
print(client.healthz())
"

Tinker Compatibility is Real

The only change is base_url. Everything else — training loops, RL pipelines, checkpointing, sampling — is identical:

import tinker

# Tinker (official):  client = tinker.ServiceClient()
# mlx-tinker:
client = tinker.ServiceClient(base_url="http://localhost:8080", api_key="local")

# Create a QLoRA training client
training = await client.create_lora_training_client_async(
    base_model="Qwen/Qwen3.5-4B", rank=8
)

# SFT training loop
for batch in train_data:
    await training.forward_backward_async(batch, loss_fn="cross_entropy")
    await training.optim_step_async(tinker.AdamParams(learning_rate=1e-4))

# Get a sampling client from the trained model
sampler = await training.save_weights_and_get_sampling_client_async()
response = await sampler.sample_async(
    prompt=tinker.ModelInput.from_ints(prompt_tokens),
    num_samples=1,
    sampling_params=tinker.SamplingParams(temperature=0.0, max_tokens=128),
)
print(tokenizer.decode(response.sequences[0].tokens))

RL works too — importance sampling, PPO, the full loop:

# RL: sample rollouts, compute rewards, train with advantages
sampler = await training.save_weights_and_get_sampling_client_async()
rollouts = await sampler.sample_async(prompt, num_samples=8,
    sampling_params=tinker.SamplingParams(temperature=0.8, max_tokens=128))

rewards = [compute_reward(seq) for seq in rollouts.sequences]
advantages = [r - sum(rewards) / len(rewards) for r in rewards]

rl_batch = [build_rl_datum(seq, advantage) for seq, advantage in zip(rollouts.sequences, advantages)]
await training.forward_backward_async(rl_batch, loss_fn="importance_sampling")
await training.optim_step_async(tinker.AdamParams(learning_rate=5e-5))

Benchmark: Tinker (official) vs mlx-tinker

50-step SFT on WikiSQL with QLoRA (rank-8, 4-bit quantization, batch_size=2):

SFT Benchmark

Metric Tinker (official) 4B mlx-tinker 4B (M4 MBP 24GB) mlx-tinker 9B (M4 MBP 24GB)
Initial loss 22.34 24.23 17.90
Final loss 0.07 0.43 0.01
Avg step time 3.7s 5.3s 8.7s
Total train time 184s 267s 436s
Post-train eval accuracy 33% 30% 29%

Both backends converge on WikiSQL SFT with comparable accuracy. mlx-tinker trades some per-step speed for running entirely on your Mac — no cloud costs, no network latency, your data stays local. And Tinker (official) doesn't even support 9B — mlx-tinker lets you train larger models that the cloud can't.

Features

QLoRA with Gradient Checkpointing

4-bit quantized base model with LoRA adapters. Gradient checkpointing recomputes activations during the backward pass, dramatically reducing memory usage. All testing and development was done on a M4 MacBook Pro 24GB.

Five Loss Functions

Loss Use Case Formula
cross_entropy SFT (-logp * w).sum()
importance_sampling Off-policy RL -(ratio * adv).sum()
ppo PPO-clip -min(ratio*adv, clip(ratio)*adv).sum()
cispo Conservative IS -(sg(clip(ratio)) * logp * adv).sum()
dro Direct Reward Opt -(logp*adv - 0.5*beta*(logp-old_lp)^2).sum()

All losses use sum reduction to match Tinker's official formulas.

Supported Models

Non-MoE Qwen3.5 family:

Model Status
Qwen/Qwen3.5-0.8B Tested
Qwen/Qwen3.5-2B Tested
Qwen/Qwen3.5-4B Tested
Qwen/Qwen3.5-9B Tested
Tesslate/OmniCoder-9B Tested

OpenAI-Compatible Inference

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello!"}]}'

Architecture

tinker.ServiceClient
        |
   FastAPI Server (api/)         19 endpoints, full Tinker wire protocol
        |
   Async Engine (engine/)        100ms polling cycles, barrier-aware batching
        |
   MLX Backend (backend/)        QLoRA, training, inference, checkpointing
        |
   Apple Silicon (Metal GPU)     Unified memory, no CUDA
  • API layer — FastAPI with Tinker-compatible request/response models, session management, and OpenAI-compat endpoints
  • Engine — Async polling loop that batches compatible requests (forward_backward, sample) and respects barriers (optim_step must wait for all pending forward_backward)
  • Backend — MLX compute: nn.value_and_grad for training, KV-cache inference, chunked cross-entropy, 8-bit Adam optimizer

Run the Tests

# Unit tests (fast, no model download needed for most)
uv run pytest tests/ -k "not stress and not cookbook"

# SFT + RL cookbook tests (downloads Qwen3.5-0.8B, ~5 min)
uv run pytest tests/cookbook/ -m cookbook -v

# Full benchmark
uv run python scripts/run_benchmark.py --mlx-only --sft-steps 50

All 10 cookbook tests pass, covering SFT convergence, RL with importance sampling, PPO, tool-use RL, and capability proofs (SFT + RL progression on exact-match tasks).

Requirements

  • macOS with Apple Silicon (M1/M2/M3/M4)
  • Python 3.12+
  • Developed and tested on a M4 MacBook Pro 24GB

Project Structure

mlx_tinker/
  api/        FastAPI server + Tinker endpoints + OpenAI compat
  engine/     Async request scheduler + dispatcher
  backend/    MLX training, inference, QLoRA, loss functions, checkpointing
  db/         SQLModel ORM with async SQLite (WAL mode)
  types.py    Tinker-compatible enums and data types
  config.py   Pydantic configuration
tests/
  cookbook/    SFT + RL end-to-end workflow tests
  stress/     Cross-framework parity tests (MLX vs PyTorch)
scripts/
  bootstrap_openclaw_rl.sh    Optional advanced helper for standalone wrapper scripts
  run_benchmark.py            Optional benchmark runner
  generate_readme_plot.py     Optional README asset generator
  generate_wcb_readme_plot.py Optional WildClawBench plot generator

Proof Of Concept Status

This repo should be read as a working proof of concept, not a fully productized local-agent platform yet.

What is honestly validated today:

  • OpenClaw:
    • managed local setup
    • live continual RL from real sessions
    • WildClawBench local RL runs on a MacBook
    • short combine runs validated against the mlx-tinker backend
  • Hermes Agent:
    • live continual RL from real sessions
    • fresh-user and resumed-session flows validated locally on Qwen/Qwen3.5-2B
    • resumed-session RL header fix validated end to end

What is still rough:

  • Hermes is still a PoC integration, not yet a polished one-command managed product like OpenClaw.
  • Hermes record persistence is not fully cleaned up yet; today the bridge logs are the source of truth for successful training runs.
  • Hermes opd / combine codepaths exist, but they have not been end-to-end validated in this repo yet.

Built-in LoRA UI

The server now ships with a built-in LoRA web UI for browsing exported adapters, checking training statistics, inspecting live in-memory LoRAs, and downloading saved LoRAs as .zip bundles.

Start the server as usual, then open /ui/loras in your browser. If you run the API on a different host or port, use that same base URL with the /ui/loras path.

The UI is backed by the same API server and scans the checkpoints tree directly, so it can surface nested sampler exports, DB-backed training stats, and built-in download actions without any extra frontend build step.

Built-in LoRA web UI

About

Tinker-compatible Backend on Apple Silicon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors