Skip to content

Commit 4c5dc46

Browse files
committed
Add LiteLLM proxy config for Venice API in Cursor
Cursor over-allocates max_tokens for models with 1M context windows (e.g. claude-opus-4-6), causing Venice to reject requests. This adds a LiteLLM proxy config that clamps output tokens to safe limits.
1 parent 6faa0a6 commit 4c5dc46

2 files changed

Lines changed: 257 additions & 0 deletions

File tree

scripts/venice-litellm/README.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# Venice API + Cursor IDE via LiteLLM Proxy
2+
3+
Venice models with 1M token context windows (e.g. `claude-opus-4-6`, `claude-sonnet-4-6`) fail in Cursor because Cursor derives `max_tokens` from the context window and sends values that exceed Venice's output token limits.
4+
5+
Models with ≤200k context (e.g. `claude-opus-45`) work without a proxy.
6+
7+
LiteLLM sits between Cursor and Venice, clamping `max_tokens` to safe values.
8+
9+
## Prerequisites
10+
11+
- `VENICE_API_KEY` exported in your shell (e.g. in `~/.zshrc`)
12+
- Python 3.10+
13+
14+
## Setup
15+
16+
```bash
17+
pip install 'litellm[proxy]'
18+
pip install python-multipart
19+
```
20+
21+
## Start the proxy
22+
23+
```bash
24+
litellm --config ~/git/edge-conventions/scripts/venice-litellm/litellm-config.yaml --port 8765
25+
```
26+
27+
The config reads your Venice API key from the `VENICE_API_KEY` environment variable via `os.environ/VENICE_API_KEY`, so LiteLLM handles all authentication to Venice directly.
28+
29+
## Configure Cursor
30+
31+
1. Open **Settings > Models**
32+
2. Add custom models by name: `claude-opus-4-6`, `openai-gpt-52`, etc.
33+
3. Set **Override OpenAI Base URL** to `http://localhost:8765`
34+
4. **OpenAI API Key** can be left disabled — LiteLLM already has the Venice key from your environment. If Cursor requires a value, enter any dummy string (e.g. `sk-dummy`).
35+
36+
## Models
37+
38+
All Venice text models are included. Only the 1M-context models need `model_info.max_tokens` to prevent Cursor from over-allocating. The rest pass through unmodified.
39+
40+
### Clamped (1M context — broken without proxy)
41+
42+
| Model | max_tokens | Context |
43+
|-------|------------|---------|
44+
| `claude-opus-4-6` | 8192 | 1M |
45+
| `claude-sonnet-4-6` | 8192 | 1M |
46+
| `gemini-3-1-pro-preview` | 8192 | 1M |
47+
48+
### Pass-through (≤256k context — work without proxy)
49+
50+
| Model | Context | Notes |
51+
|-------|---------|-------|
52+
| `claude-opus-45` | 198k | |
53+
| `claude-sonnet-45` | 198k | |
54+
| `openai-gpt-52` | 256k | |
55+
| `openai-gpt-52-codex` | 256k | Optimized for code |
56+
| `openai-gpt-oss-120b` | 128k | Open-weight MoE |
57+
| `grok-41-fast` | 256k | |
58+
| `grok-code-fast-1` | 256k | Optimized for code |
59+
| `gemini-3-pro-preview` | 198k | |
60+
| `gemini-3-flash-preview` | 256k | |
61+
| `deepseek-v3.2` | 160k | |
62+
| `kimi-k2-thinking` | 256k | |
63+
| `kimi-k2-5` | 256k | |
64+
| `minimax-m21` | 198k | Optimized for code |
65+
| `minimax-m25` | 198k | Optimized for code |
66+
| `zai-org-glm-5` | 198k | |
67+
| `zai-org-glm-4.7` | 198k | |
68+
| `qwen3-coder-480b-a35b-instruct` | 256k | Optimized for code |
69+
| `qwen3-235b-a22b-thinking-2507` | 128k | |
70+
| `qwen3-235b-a22b-instruct-2507` | 128k | |
71+
| `qwen3-vl-235b-a22b` | 256k | Vision-language |
72+
| `llama-3.3-70b` | 128k | |
73+
| `hermes-3-llama-3.1-405b` | 128k | |
74+
| `google-gemma-3-27b-it` | 198k | Vision |
75+
76+
## Adding models
77+
78+
Edit `litellm-config.yaml` following the existing pattern. Use Venice model IDs from their [models endpoint](https://docs.venice.ai/api-reference/endpoint/models/list). Only add `model_info.max_tokens` for models with context windows >256k.
79+
80+
## Why not use Venice directly?
81+
82+
Venice advertises `availableContextTokens: 1000000` for newer Claude/Gemini models. Cursor uses this to budget `max_tokens`, often requesting 200k+ output tokens. Venice rejects these with:
83+
84+
```
85+
max_tokens: 232001 > 128000, which is the maximum allowed number of output tokens for claude-opus-4-6
86+
```
87+
88+
The proxy intercepts this by setting `model_info.max_tokens` per model, which LiteLLM uses to constrain requests.
Lines changed: 169 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,169 @@
1+
model_list:
2+
# --- 1M context models (NEED clamping) ---
3+
4+
- model_name: claude-opus-4-6
5+
litellm_params:
6+
model: openai/claude-opus-4-6
7+
api_base: https://api.venice.ai/api/v1
8+
api_key: os.environ/VENICE_API_KEY
9+
model_info:
10+
max_tokens: 8192
11+
12+
- model_name: claude-sonnet-4-6
13+
litellm_params:
14+
model: openai/claude-sonnet-4-6
15+
api_base: https://api.venice.ai/api/v1
16+
api_key: os.environ/VENICE_API_KEY
17+
model_info:
18+
max_tokens: 8192
19+
20+
- model_name: gemini-3-1-pro-preview
21+
litellm_params:
22+
model: openai/gemini-3-1-pro-preview
23+
api_base: https://api.venice.ai/api/v1
24+
api_key: os.environ/VENICE_API_KEY
25+
model_info:
26+
max_tokens: 8192
27+
28+
# --- ≤256k context models (pass-through, no clamping needed) ---
29+
30+
- model_name: claude-opus-45
31+
litellm_params:
32+
model: openai/claude-opus-45
33+
api_base: https://api.venice.ai/api/v1
34+
api_key: os.environ/VENICE_API_KEY
35+
36+
- model_name: claude-sonnet-45
37+
litellm_params:
38+
model: openai/claude-sonnet-45
39+
api_base: https://api.venice.ai/api/v1
40+
api_key: os.environ/VENICE_API_KEY
41+
42+
- model_name: openai-gpt-52
43+
litellm_params:
44+
model: openai/openai-gpt-52
45+
api_base: https://api.venice.ai/api/v1
46+
api_key: os.environ/VENICE_API_KEY
47+
48+
- model_name: openai-gpt-52-codex
49+
litellm_params:
50+
model: openai/openai-gpt-52-codex
51+
api_base: https://api.venice.ai/api/v1
52+
api_key: os.environ/VENICE_API_KEY
53+
54+
- model_name: openai-gpt-oss-120b
55+
litellm_params:
56+
model: openai/openai-gpt-oss-120b
57+
api_base: https://api.venice.ai/api/v1
58+
api_key: os.environ/VENICE_API_KEY
59+
60+
- model_name: grok-41-fast
61+
litellm_params:
62+
model: openai/grok-41-fast
63+
api_base: https://api.venice.ai/api/v1
64+
api_key: os.environ/VENICE_API_KEY
65+
66+
- model_name: grok-code-fast-1
67+
litellm_params:
68+
model: openai/grok-code-fast-1
69+
api_base: https://api.venice.ai/api/v1
70+
api_key: os.environ/VENICE_API_KEY
71+
72+
- model_name: gemini-3-pro-preview
73+
litellm_params:
74+
model: openai/gemini-3-pro-preview
75+
api_base: https://api.venice.ai/api/v1
76+
api_key: os.environ/VENICE_API_KEY
77+
78+
- model_name: gemini-3-flash-preview
79+
litellm_params:
80+
model: openai/gemini-3-flash-preview
81+
api_base: https://api.venice.ai/api/v1
82+
api_key: os.environ/VENICE_API_KEY
83+
84+
- model_name: deepseek-v3.2
85+
litellm_params:
86+
model: openai/deepseek-v3.2
87+
api_base: https://api.venice.ai/api/v1
88+
api_key: os.environ/VENICE_API_KEY
89+
90+
- model_name: kimi-k2-thinking
91+
litellm_params:
92+
model: openai/kimi-k2-thinking
93+
api_base: https://api.venice.ai/api/v1
94+
api_key: os.environ/VENICE_API_KEY
95+
96+
- model_name: kimi-k2-5
97+
litellm_params:
98+
model: openai/kimi-k2-5
99+
api_base: https://api.venice.ai/api/v1
100+
api_key: os.environ/VENICE_API_KEY
101+
102+
- model_name: minimax-m21
103+
litellm_params:
104+
model: openai/minimax-m21
105+
api_base: https://api.venice.ai/api/v1
106+
api_key: os.environ/VENICE_API_KEY
107+
108+
- model_name: minimax-m25
109+
litellm_params:
110+
model: openai/minimax-m25
111+
api_base: https://api.venice.ai/api/v1
112+
api_key: os.environ/VENICE_API_KEY
113+
114+
- model_name: zai-org-glm-5
115+
litellm_params:
116+
model: openai/zai-org-glm-5
117+
api_base: https://api.venice.ai/api/v1
118+
api_key: os.environ/VENICE_API_KEY
119+
120+
- model_name: zai-org-glm-4.7
121+
litellm_params:
122+
model: openai/zai-org-glm-4.7
123+
api_base: https://api.venice.ai/api/v1
124+
api_key: os.environ/VENICE_API_KEY
125+
126+
- model_name: qwen3-coder-480b-a35b-instruct
127+
litellm_params:
128+
model: openai/qwen3-coder-480b-a35b-instruct
129+
api_base: https://api.venice.ai/api/v1
130+
api_key: os.environ/VENICE_API_KEY
131+
132+
- model_name: qwen3-235b-a22b-thinking-2507
133+
litellm_params:
134+
model: openai/qwen3-235b-a22b-thinking-2507
135+
api_base: https://api.venice.ai/api/v1
136+
api_key: os.environ/VENICE_API_KEY
137+
138+
- model_name: qwen3-235b-a22b-instruct-2507
139+
litellm_params:
140+
model: openai/qwen3-235b-a22b-instruct-2507
141+
api_base: https://api.venice.ai/api/v1
142+
api_key: os.environ/VENICE_API_KEY
143+
144+
- model_name: qwen3-vl-235b-a22b
145+
litellm_params:
146+
model: openai/qwen3-vl-235b-a22b
147+
api_base: https://api.venice.ai/api/v1
148+
api_key: os.environ/VENICE_API_KEY
149+
150+
- model_name: llama-3.3-70b
151+
litellm_params:
152+
model: openai/llama-3.3-70b
153+
api_base: https://api.venice.ai/api/v1
154+
api_key: os.environ/VENICE_API_KEY
155+
156+
- model_name: hermes-3-llama-3.1-405b
157+
litellm_params:
158+
model: openai/hermes-3-llama-3.1-405b
159+
api_base: https://api.venice.ai/api/v1
160+
api_key: os.environ/VENICE_API_KEY
161+
162+
- model_name: google-gemma-3-27b-it
163+
litellm_params:
164+
model: openai/google-gemma-3-27b-it
165+
api_base: https://api.venice.ai/api/v1
166+
api_key: os.environ/VENICE_API_KEY
167+
168+
router_settings:
169+
enable_pre_call_checks: true

0 commit comments

Comments
 (0)