Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ Get a free API key from [Stream](https://getstream.io/). Developers receive **33

| **Plugin Name** | **Description** | **Docs Link** |
|-------------|-------------|-----------|
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | [AWS Polly](https://visionagents.ai/integrations/aws-polly) |
| AWS | AWS (Bedrock) integration with support for standard LLM (Qwen, Claude with vision), realtime with Nova 2 Sonic, and TTS with AWS Polly | [AWS](https://visionagents.ai/integrations/aws) |
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | [Cartesia](https://visionagents.ai/integrations/cartesia) |
| Decart | Real-time video restyling capabilities using generative AI models | [Decart](https://visionagents.ai/integrations/decart) |
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | [Deepgram](https://visionagents.ai/integrations/deepgram) |
Expand Down Expand Up @@ -225,7 +225,7 @@ While building the integrations, here are the limitations we've noticed (Dec 202
* Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
* Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
* Image size & FPS need to stay relatively low due to performance constraints
* Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.
* Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.

## Star History

Expand Down
95 changes: 79 additions & 16 deletions plugins/aws/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,101 @@
# AWS Plugin for Vision Agents

AWS (Bedrock) LLM integration for Vision Agents framework with support for both standard and realtime interactions.
AWS (Bedrock) integration for Vision Agents framework with support for standard LLM, realtime with Nova Sonic, and text-to-speech with automatic session resumption.

## Installation

```bash
pip install vision-agents-plugins-aws
uv add vision-agents[aws]
```

## Usage

### Standard LLM Usage

This example shows how to use qwen3 on bedrock for the LLM.
The AWS plugin supports various Bedrock models including Qwen, Claude, and others. Claude models also support vision/image inputs.

```python
from vision_agents.core import Agent, User
from vision_agents.plugins import aws, getstream, cartesia, deepgram, smart_turn

agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Friendly AI"),
instructions="Be nice to the user",
llm=aws.LLM(model="qwen.qwen3-32b-v1:0"),
llm=aws.LLM(
model="qwen.qwen3-32b-v1:0",
region_name="us-east-1"
),
tts=cartesia.TTS(),
stt=deepgram.STT(),
turn_detection=smart_turn.TurnDetection(buffer_duration=2.0, confidence_threshold=0.5),
)
```

The full example is available in example/aws_qwen_example.py
For vision-capable models like Claude:

```python
llm = aws.LLM(
model="anthropic.claude-3-haiku-20240307-v1:0",
region_name="us-east-1"
)

# Send image with text
response = await llm.converse(
messages=[{
"role": "user",
"content": [
{"image": {"format": "png", "source": {"bytes": image_bytes}}},
{"text": "What do you see in this image?"}
]
}]
)
```

### Realtime Audio Usage

Nova Sonic audio realtime STS is also supported:
AWS Nova 2 Sonic provides realtime speech-to-speech capabilities with automatic reconnection logic. The default model is `amazon.nova-2-sonic-v1:0`.

```python
from vision_agents.core import Agent, User
from vision_agents.plugins import aws, getstream

```python
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Story Teller AI"),
instructions="Tell a story suitable for a 7 year old about a dragon and a princess",
llm=aws.Realtime(),
llm=aws.Realtime(
model="amazon.nova-2-sonic-v1:0",
region_name="us-east-1",
voice_id="matthew" # See available voices in AWS Nova documentation
),
)
```

The Realtime implementation includes automatic reconnection logic that reconnects after periods of silence or when approaching connection time limits.

See `example/aws_realtime_nova_example.py` for a complete example.

### Text-to-Speech (TTS)

AWS Polly TTS is available for converting text to speech:

```python
from vision_agents.plugins import aws

tts = aws.TTS(
region_name="us-east-1",
voice_id="Joanna", # AWS Polly voice ID
engine="neural", # 'standard' or 'neural'
text_type="text", # 'text' or 'ssml'
language_code="en-US"
)

# Use in agent
agent = Agent(
llm=aws.LLM(model="qwen.qwen3-32b-v1:0"),
tts=tts,
# ... other components
)
```

Expand Down Expand Up @@ -70,14 +128,15 @@ def get_weather(city: str) -> dict:

### Realtime (aws.Realtime)

The Realtime implementation **fully supports** function calling with AWS Nova Sonic. Register functions using the `@llm.register_function` decorator:
The Realtime implementation **fully supports** function calling with AWS Nova 2 Sonic. Register functions using the `@llm.register_function` decorator:

```python
from vision_agents.plugins import aws

llm = aws.Realtime(
model="amazon.nova-sonic-v1:0",
region_name="us-east-1"
model="amazon.nova-2-sonic-v1:0",
region_name="us-east-1",
voice_id="matthew"
)

@llm.register_function(
Expand All @@ -97,19 +156,23 @@ def get_weather(city: str) -> dict:

See `example/aws_realtime_function_calling_example.py` for a complete example.

## Running the examples
## Configuration

Create a `.env` file, or cp .env.example to .env and fill in
### Environment Variables

Create a `.env` file with the following variables:

```
STREAM_API_KEY=your_stream_api_key_here
STREAM_API_SECRET=your_stream_api_secret_here

AWS_BEARER_TOKEN_BEDROCK=
AWS_BEDROCK_API_KEY=
AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=
AWS_REGION=us-east-1

FAL_KEY=
CARTESIA_API_KEY=
DEEPGRAM_API_KEY=
```
```

Make sure your `.env` file is configured before running the examples.
Loading