mlx-tinker

Proof-of-concept local Tinker backend for Apple Silicon that can actually keep learning. Run Qwen3.5 locally on a MacBook, plug it into an agent runtime, and do continual RL updates from real agent trajectories.

mlx-tinker implements the Tinker API on top of Apple's MLX framework. The interesting part is not just local inference: this repo now has working local continual-learning loops for both OpenClaw and Hermes Agent, with reward flowing back into local PPO-style updates on Apple Silicon.

Local Continual RL on a MacBook

In the validated OpenClaw setup below, WildClawBench task containers run OpenClaw, OpenClaw-RL scores the resulting trajectories, and mlx-tinker applies PPO updates locally on a MacBook.

The plotted run is an end-to-end local OpenClaw loop -> tasks sourced from WildClawBench.
In that run, the system completed 32 PPO steps and scored 39 trajectories locally.
Reward moves off the floor and positive-reward steps start appearing in the back half of training.
The same local stack also supports live continual learning from real agent sessions.
This work was validated on an M4 MacBook Pro with 24GB unified memory.

The currently validated OpenClaw-RL dependency is the fork branch ojus1/OpenClaw-RL@codex/qwen35-openclaw-tinker. mlx-tinker bootstraps that automatically in the managed OpenClaw path, so you do not need to wait for upstream PR timing to use the local-learning stack.

Multi-turn agent use is practical because mlx-tinker is not recomputing every long prompt from scratch on every turn. It uses disk-backed transcript prefix caching to offload reusable prompt/KV state locally, so repeated system prompts, tool schemas, and conversation prefixes can be restored instead of rebuilt. That is paired with quantized KV cache support for in-memory generation and gradient checkpointing for training-time memory savings, which is what makes longer agent sessions and local continual RL workable on a MacBook.

Choose Your Integration

OpenClaw

This is the most complete path today. It has the best onboarding story and the most thorough validation.

OpenClaw Requirements And First Install

macOS with Apple Silicon
Python 3.12+
uv
git
node
Docker Desktop

Recommended models:

Qwen/Qwen3.5-4B on 24GB+ Macs
Qwen/Qwen3.5-0.8B on smaller-memory Macs

Install the repo:

git clone https://github.com/ojus1/mlx-tinker.git
cd mlx-tinker
uv sync

Then start the managed local-learning stack:

uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B

The first run downloads the model, clones the required external repos into .external/ (OpenClaw and the supported OpenClaw-RL fork), and builds the local OpenClaw gateway image, so expect it to take a few minutes.

That one command starts three pieces:

native mlx-tinker inference + training backend
native OpenClaw-RL proxy/trainer
Dockerized OpenClaw gateway

It also patches OpenClaw to use the stable local model alias mlx-tinker-local/local-primary, installs the RL header plugin, and stores managed runtime state under ~/.openclaw/mlx-tinker/.

New OpenClaw Users

uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B
uv run python -m mlx_tinker openclaw status

After setup:

OpenClaw is available on the local gateway port shown by status
the default model already points at the local learning backend
local webchat/CLI sessions stay inline instead of trying to route through outbound messaging tools

Existing OpenClaw Users

Run the same command:

uv run python -m mlx_tinker openclaw setup --model Qwen/Qwen3.5-4B

The managed setup preserves your existing OpenClaw installation:

it backs up the current ~/.openclaw/openclaw.json
it keeps your channels, other agent settings, and workspace defaults intact
it only switches the model/backend path over to the managed local-learning stack

OpenClaw Service Commands

uv run python -m mlx_tinker openclaw status
uv run python -m mlx_tinker openclaw logs --service all
uv run python -m mlx_tinker openclaw start
uv run python -m mlx_tinker openclaw stop

Hermes Agent

It is not yet a one-command managed onboarding experience like OpenClaw.

Validated today:

live RL with Hermes Agent -> Hermes RL bridge -> mlx-tinker
new-user flow on a fresh HERMES_HOME
existing-user / resumed-session flow on a persisted HERMES_HOME
end-to-end local training on Qwen/Qwen3.5-2B

Not yet claimed:

end-to-end Hermes combine
end-to-end Hermes opd
polished Docker-managed Hermes onboarding

Hermes Requirements And First Install

macOS with Apple Silicon
Python 3.12+
uv
git

Install mlx-tinker:

git clone https://github.com/ojus1/mlx-tinker.git
cd mlx-tinker
uv sync

Clone the Hermes fork with the minimal live-RL header patch:

git clone https://github.com/ojus1/hermes-agent.git .external/hermes-agent
git -C .external/hermes-agent checkout codex/mlx-tinker-live-rl

For the currently validated Hermes PoC settings, start the local stack like this:

MODEL_NAME='Qwen/Qwen3.5-2B' \
HERMES_RL_MAX_CONTEXT_TOKENS=2048 \
HERMES_RL_LORA_RANK=8 \
bash scripts/run_hermes_rl.sh

That helper starts:

native mlx-tinker
the Hermes RL bridge at http://127.0.0.1:30050/v1
a local training loop listening for Hermes traffic

New Hermes Users

Point Hermes at a fresh home directory:

HERMES_HOME=~/hermes-poc-new \
HERMES_RL_ENABLED=1 \
HERMES_RL_PROXY_BASE_URL=http://127.0.0.1:30050/v1 \
uv run --directory .external/hermes-agent python run_agent.py \
  --model hermes-local \
  --base_url http://127.0.0.1:30050/v1 \
  --api_key hermes-local

Existing Hermes Users

Keep your current Hermes home and point Hermes at the bridge:

HERMES_HOME=~/.hermes \
HERMES_RL_ENABLED=1 \
HERMES_RL_PROXY_BASE_URL=http://127.0.0.1:30050/v1 \
uv run --directory .external/hermes-agent python run_agent.py \
  --model hermes-local \
  --base_url http://127.0.0.1:30050/v1 \
  --api_key hermes-local

The live-learning signal is header-based, not transcript scraping. Main Hermes model calls are tagged with X-Session-Id, X-Hermes-Outer-Turn-Id, X-Hermes-Step-Index, X-Turn-Type, and X-Hermes-Request-Id, which lets the bridge reconstruct multi-step tool-use trajectories, dedupe retries, and score turns against the next state.

How To Tell Learning Is Actually Happening

Serving and training are separate things, so the right thing to check is the proxy/trainer log.

OpenClaw

uv run python -m mlx_tinker openclaw logs --service proxy

Look for lines like:

submitted session=...
drained 1 groups
forward_backward
optim_step

OpenClaw training records are written under:

~/.openclaw/mlx-tinker/records/conversations.jsonl
~/.openclaw/mlx-tinker/records/prm_scores.jsonl

Hermes Agent

tail -f /tmp/hermes_rl_bridge.log
tail -f /tmp/hermes_mlx_tinker.log

Look for the same progression:

submitted session=...
prm_eval_score=...
step 1: forward_backward
step 1: optim_step
training complete

One important nuance for both integrations: the current RL path scores a turn against the next state, so a single isolated one-turn chat will not train immediately. Once the same session gets a follow-up user turn, the previous turn can be scored and submitted into PPO.

The loop is also mostly asynchronous. Inference stays live during batch collection, PRM scoring, forward_backward, and optim_step. The one deliberate pause is the weight swap: after an optimizer step, the proxy briefly pauses new submissions while it installs the updated sampling client, then resumes normal traffic.

Important Configs For Real Multi-Turn Use

If you want the local agent loop to feel good on longer sessions, these are the knobs that matter most:

--max-context-tokens on the RL side controls how much context each training datum keeps before truncation.
--prefix-cache-disk-limit-gb on mlx-tinker controls how much disk space is available for transcript prefix caching.
--kv-cache-bits and --kv-cache-group-size control KV-cache quantization for inference.
--quantized-kv-start controls when KV-cache quantization begins.
--checkpoints controls where LoRA checkpoints and prefix-cache artifacts are stored.
--max-batch-size and --cycle-ms are the backend scheduling knobs.

Current validated defaults:

OpenClaw managed path:
- RL batch size: 1
- RL max context tokens: 8192
- gateway bind: lan
Hermes PoC path:
- model: Qwen/Qwen3.5-2B
- LoRA rank: 8
- RL batch size: 1
- RL max context tokens: 2048

Use mlx-tinker as a Plain Tinker Backend

If you do not want the OpenClaw stack and only want a local Tinker-compatible server, that path still works too:

uv run python -m mlx_tinker --model Qwen/Qwen3.5-0.8B

Then point the normal Tinker SDK at localhost:

python -c "
import tinker
client = tinker.ServiceClient(base_url='http://localhost:8080', api_key='local')
print(client.healthz())
"

Tinker Compatibility is Real

The only change is base_url. Everything else — training loops, RL pipelines, checkpointing, sampling — is identical:

import tinker

# Tinker (official):  client = tinker.ServiceClient()
# mlx-tinker:
client = tinker.ServiceClient(base_url="http://localhost:8080", api_key="local")

# Create a QLoRA training client
training = await client.create_lora_training_client_async(
    base_model="Qwen/Qwen3.5-4B", rank=8
)

# SFT training loop
for batch in train_data:
    await training.forward_backward_async(batch, loss_fn="cross_entropy")
    await training.optim_step_async(tinker.AdamParams(learning_rate=1e-4))

# Get a sampling client from the trained model
sampler = await training.save_weights_and_get_sampling_client_async()
response = await sampler.sample_async(
    prompt=tinker.ModelInput.from_ints(prompt_tokens),
    num_samples=1,
    sampling_params=tinker.SamplingParams(temperature=0.0, max_tokens=128),
)
print(tokenizer.decode(response.sequences[0].tokens))

RL works too — importance sampling, PPO, the full loop:

# RL: sample rollouts, compute rewards, train with advantages
sampler = await training.save_weights_and_get_sampling_client_async()
rollouts = await sampler.sample_async(prompt, num_samples=8,
    sampling_params=tinker.SamplingParams(temperature=0.8, max_tokens=128))

rewards = [compute_reward(seq) for seq in rollouts.sequences]
advantages = [r - sum(rewards) / len(rewards) for r in rewards]

rl_batch = [build_rl_datum(seq, advantage) for seq, advantage in zip(rollouts.sequences, advantages)]
await training.forward_backward_async(rl_batch, loss_fn="importance_sampling")
await training.optim_step_async(tinker.AdamParams(learning_rate=5e-5))

Benchmark: Tinker (official) vs mlx-tinker

50-step SFT on WikiSQL with QLoRA (rank-8, 4-bit quantization, batch_size=2):

Metric	Tinker (official) 4B	mlx-tinker 4B (M4 MBP 24GB)	mlx-tinker 9B (M4 MBP 24GB)
Initial loss	22.34	24.23	17.90
Final loss	0.07	0.43	0.01
Avg step time	3.7s	5.3s	8.7s
Total train time	184s	267s	436s
Post-train eval accuracy	33%	30%	29%

Both backends converge on WikiSQL SFT with comparable accuracy. mlx-tinker trades some per-step speed for running entirely on your Mac — no cloud costs, no network latency, your data stays local. And Tinker (official) doesn't even support 9B — mlx-tinker lets you train larger models that the cloud can't.

Features

QLoRA with Gradient Checkpointing

4-bit quantized base model with LoRA adapters. Gradient checkpointing recomputes activations during the backward pass, dramatically reducing memory usage. All testing and development was done on a M4 MacBook Pro 24GB.

Five Loss Functions

Loss	Use Case	Formula
`cross_entropy`	SFT	`(-logp * w).sum()`
`importance_sampling`	Off-policy RL	`-(ratio * adv).sum()`
`ppo`	PPO-clip	`-min(ratioadv, clip(ratio)adv).sum()`
`cispo`	Conservative IS	`-(sg(clip(ratio)) * logp * adv).sum()`
`dro`	Direct Reward Opt	`-(logpadv - 0.5beta*(logp-old_lp)^2).sum()`

All losses use sum reduction to match Tinker's official formulas.

Supported Models

Non-MoE Qwen3.5 family:

Model	Status
Qwen/Qwen3.5-0.8B	Tested
Qwen/Qwen3.5-2B	Tested
Qwen/Qwen3.5-4B	Tested
Qwen/Qwen3.5-9B	Tested
Tesslate/OmniCoder-9B	Tested

OpenAI-Compatible Inference

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello!"}]}'

Architecture

tinker.ServiceClient
        |
   FastAPI Server (api/)         19 endpoints, full Tinker wire protocol
        |
   Async Engine (engine/)        100ms polling cycles, barrier-aware batching
        |
   MLX Backend (backend/)        QLoRA, training, inference, checkpointing
        |
   Apple Silicon (Metal GPU)     Unified memory, no CUDA

API layer — FastAPI with Tinker-compatible request/response models, session management, and OpenAI-compat endpoints
Engine — Async polling loop that batches compatible requests (forward_backward, sample) and respects barriers (optim_step must wait for all pending forward_backward)
Backend — MLX compute: nn.value_and_grad for training, KV-cache inference, chunked cross-entropy, 8-bit Adam optimizer

Run the Tests

# Unit tests (fast, no model download needed for most)
uv run pytest tests/ -k "not stress and not cookbook"

# SFT + RL cookbook tests (downloads Qwen3.5-0.8B, ~5 min)
uv run pytest tests/cookbook/ -m cookbook -v

# Full benchmark
uv run python scripts/run_benchmark.py --mlx-only --sft-steps 50

All 10 cookbook tests pass, covering SFT convergence, RL with importance sampling, PPO, tool-use RL, and capability proofs (SFT + RL progression on exact-match tasks).

Requirements

macOS with Apple Silicon (M1/M2/M3/M4)
Python 3.12+
Developed and tested on a M4 MacBook Pro 24GB

Project Structure

mlx_tinker/
  api/        FastAPI server + Tinker endpoints + OpenAI compat
  engine/     Async request scheduler + dispatcher
  backend/    MLX training, inference, QLoRA, loss functions, checkpointing
  db/         SQLModel ORM with async SQLite (WAL mode)
  types.py    Tinker-compatible enums and data types
  config.py   Pydantic configuration
tests/
  cookbook/    SFT + RL end-to-end workflow tests
  stress/     Cross-framework parity tests (MLX vs PyTorch)
scripts/
  bootstrap_openclaw_rl.sh    Optional advanced helper for standalone wrapper scripts
  run_benchmark.py            Optional benchmark runner
  generate_readme_plot.py     Optional README asset generator
  generate_wcb_readme_plot.py Optional WildClawBench plot generator

Proof Of Concept Status

This repo should be read as a working proof of concept, not a fully productized local-agent platform yet.

What is honestly validated today:

OpenClaw:
- managed local setup
- live continual RL from real sessions
- WildClawBench local RL runs on a MacBook
- short combine runs validated against the mlx-tinker backend
Hermes Agent:
- live continual RL from real sessions
- fresh-user and resumed-session flows validated locally on Qwen/Qwen3.5-2B
- resumed-session RL header fix validated end to end

What is still rough:

Hermes is still a PoC integration, not yet a polished one-command managed product like OpenClaw.
Hermes record persistence is not fully cleaned up yet; today the bridge logs are the source of truth for successful training runs.
Hermes opd / combine codepaths exist, but they have not been end-to-end validated in this repo yet.

Built-in LoRA UI

The server now ships with a built-in LoRA web UI for browsing exported adapters, checking training statistics, inspecting live in-memory LoRAs, and downloading saved LoRAs as .zip bundles.

Start the server as usual, then open /ui/loras in your browser. If you run the API on a different host or port, use that same base URL with the /ui/loras path.

The UI is backed by the same API server and scans the checkpoints tree directly, so it can surface nested sampler exports, DB-backed training stats, and built-in download actions without any extra frontend build step.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
assets		assets
configs		configs
extensions/local-chat-guardrails		extensions/local-chat-guardrails
integrations		integrations
mlx_tinker		mlx_tinker
scripts		scripts
tests		tests
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-tinker

Local Continual RL on a MacBook

Choose Your Integration

OpenClaw

Hermes Agent

How To Tell Learning Is Actually Happening

OpenClaw

Hermes Agent

Important Configs For Real Multi-Turn Use

Use mlx-tinker as a Plain Tinker Backend

Tinker Compatibility is Real

Benchmark: Tinker (official) vs mlx-tinker

Features

QLoRA with Gradient Checkpointing

Five Loss Functions

Supported Models

OpenAI-Compatible Inference

Architecture

Run the Tests

Requirements

Project Structure

Proof Of Concept Status

Built-in LoRA UI

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-tinker

Local Continual RL on a MacBook

Choose Your Integration

OpenClaw

Hermes Agent

How To Tell Learning Is Actually Happening

OpenClaw

Hermes Agent

Important Configs For Real Multi-Turn Use

Use mlx-tinker as a Plain Tinker Backend

Tinker Compatibility is Real

Benchmark: Tinker (official) vs mlx-tinker

Features

QLoRA with Gradient Checkpointing

Five Loss Functions

Supported Models

OpenAI-Compatible Inference

Architecture

Run the Tests

Requirements

Project Structure

Proof Of Concept Status

Built-in LoRA UI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages