Skip to content

Feature Request: Add multimodal (image/vision) support #10

@yangsjt

Description

@yangsjt

Feature Request

Description

Add support for multimodal content (images/vision) when proxying requests between Claude API format and OpenAI-compatible APIs.

Current Behavior

Currently, the converter in internal/converter/converter.go only handles:

  • Text content (string messages)
  • Tool calls (tool_use, tool_result)
  • Thinking blocks

Image content blocks are not converted, which means multimodal requests from Claude Code cannot be forwarded to backend providers that support vision capabilities.

Desired Behavior

Support converting Claude's image content format to OpenAI's image_url format:

Claude format (input):

{
  "type": "image",
  "source": {
    "type": "base64",
    "media_type": "image/png",
    "data": "<base64_data>"
  }
}

OpenAI format (output):

{
  "type": "image_url",
  "image_url": {
    "url": "data:image/png;base64,<base64_data>"
  }
}

Use Case

Many users want to use Claude Code with vision-capable models like:

  • GPT-4o / GPT-4 Vision (via OpenAI Direct)
  • Gemini Pro Vision (via OpenRouter)
  • LLaVA, Qwen-VL (via Ollama)

Without multimodal support, users cannot leverage these vision capabilities through the proxy.

Reference

The similar project takltc/claude-code-chutes-proxy has implemented this feature:

"Images and multimodal: request-side user/system image blocks are translated to OpenAI image_url content entries"

Additional Context

This would significantly expand the proxy's utility for workflows involving:

  • Screenshot analysis
  • Diagram/architecture review
  • UI/UX feedback
  • Document image processing

Thank you for this great project! 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions