Skip to content

Commit 97fbb24

Browse files
committed
Merge branch 'main' into security-camera-example
2 parents 4453e0d + 04f006f commit 97fbb24

File tree

122 files changed

+17415
-8879
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

122 files changed

+17415
-8879
lines changed

.env.example

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,3 +25,7 @@ CARTESIA_API_KEY=your_cartesia_api_key_here
2525

2626
# Anthropic API credentials
2727
ANTHROPIC_API_KEY=your_anthropic_api_key_here
28+
29+
# Roboflow API credentials
30+
ROBOFLOW_API_KEY=your_roboflow_api_key_here
31+
ROBOFLOW_API_URL=your_roboflow_api_url_here

README.md

Lines changed: 42 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,8 @@
1212

1313
## Build Real-Time Vision AI Agents
1414

15-
<a href="https://youtu.be/Hpl5EcCpLw8">
16-
<img src="assets/demo_thumbnail.png" alt="Watch the demo" style="width:100%; max-width:900px;">
17-
</a>
15+
https://github.com/user-attachments/assets/d9778ab9-938d-4101-8605-ff879c29b0e4
16+
1817

1918
### Multi-modal AI agents that watch, listen, and understand video.
2019

@@ -28,12 +27,16 @@ Vision Agents give you the building blocks to create intelligent, low-latency vi
2827
- **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.
2928
- **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
3029

30+
https://github.com/user-attachments/assets/d66587ea-7af4-40c4-9966-5c04fbcf467c
31+
3132
---
3233

3334
## See It In Action
3435

3536
### Sports Coaching
3637

38+
https://github.com/user-attachments/assets/9527ab03-0541-493b-97b1-e17ff1b20e21
39+
3740
This example shows you how to build golf coaching AI with YOLO and OpenAI realtime.
3841
Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
3942
For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
@@ -54,10 +57,6 @@ This example shows you how to build golf coaching AI with YOLO and OpenAI realti
5457
Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
5558
For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.
5659

57-
<a href="https://x.com/nash0x7e2/status/1950341779745599769">
58-
<img src="assets/golf_example_tweet.png" alt="Golf Example" style="width:100%; max-width:800px;">
59-
</a>
60-
6160
### Cluely style Invisible Assistant (coming soon)
6261

6362
Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant.
@@ -107,7 +106,7 @@ Get a free API key from [Stream](https://getstream.io/). Developers receive **33
107106

108107
| **Plugin Name** | **Description** | **Docs Link** |
109108
|-------------|-------------|-----------|
110-
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | [AWS Polly](https://visionagents.ai/integrations/aws-polly) |
109+
| AWS | AWS (Bedrock) integration with support for standard LLM (Qwen, Claude with vision), realtime with Nova 2 Sonic, and TTS with AWS Polly | [AWS](https://visionagents.ai/integrations/aws) |
111110
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | [Cartesia](https://visionagents.ai/integrations/cartesia) |
112111
| Decart | Real-time video restyling capabilities using generative AI models | [Decart](https://visionagents.ai/integrations/decart) |
113112
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | [Deepgram](https://visionagents.ai/integrations/deepgram) |
@@ -146,6 +145,16 @@ Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai
146145
**Tutorial:** [Building real-time sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)
147146
**Tutorial:** [Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)
148147

148+
## Examples
149+
150+
| 🔮 Demo Applications | |
151+
|:-----|---------|
152+
| <br><h3>Cartesia</h3>Using Cartesia's Sonic 3 model to visually look at what's in the frame and tell a story with emotion.<br><br>• Real-time visual understanding<br>• Emotional storytelling<br>• Frame-by-frame analysis<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/cartesia/example) | <img src="assets/demo_gifs/cartesia.gif" width="320" alt="Cartesia Demo"> |
153+
| <br><h3>Realtime Stable Diffusion</h3>Realtime stable diffusion using Vision Agents and Decart's Mirage 2 model to create interactive scenes and stories.<br><br>• Real-time video restyling<br>• Interactive scene generation<br>• Stable diffusion integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/decart/example) | <img src="assets/demo_gifs/mirage.gif" width="320" alt="Mirage Demo"> |
154+
| <br><h3>Golf Coach</h3>Using Gemini Live together with Vision Agents and Ultralytics YOLO, we're able to track the user's pose and provide realtime actionable feedback on their golf game.<br><br>• Real-time pose tracking<br>• Actionable coaching feedback<br>• YOLO pose detection<br>• Gemini Live integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example) | <img src="assets/demo_gifs/golf.gif" width="320" alt="Golf Coach Demo"> |
155+
| <br><h3>GeoGuesser</h3>Together with OpenAI Realtime and Vision Agents, we can take GeoGuesser to the next level by asking it to identify places in our real world surroundings.<br><br>• Real-world location identification<br>• OpenAI Realtime integration<br>• Visual scene understanding<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/other_examples/openai_realtime_webrtc)| <img src="assets/demo_gifs/geoguesser.gif" width="320" alt="GeoGuesser Demo"> |
156+
157+
149158
## Development
150159

151160
See [DEVELOPMENT.md](DEVELOPMENT.md)
@@ -192,20 +201,40 @@ Our favorite people & projects to follow for vision AI
192201

193202
### 0.2 - Simplification - Nov
194203

195-
- Simplify the library & improved code quality
204+
- Simplified the library & improved code quality
196205
- Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
197206
- Improved openAI & Gemini realtime performance
198207
- Audio & Video utilities
199208

200-
### 0.3 - Demos - Nov/Dec
209+
### 0.3 - Demos - Early Dec
201210

202-
### 0.4 - Deploys
211+
- Mirage, Roboflow, Facial recognition. Nicer demos
212+
- Deepgram Flux & Elevenlabs Scribe improvements
213+
214+
### 0.4 - Deploys - December
203215

204216
- Tips on deploying agents at scale, monitoring them etc.
217+
- Guides on SIP & RAG
218+
219+
## Vision AI limitations
220+
221+
Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video.
222+
While building the integrations, here are the limitations we've noticed (Dec 2025)
223+
224+
* Video AI struggles with small text. If you want the AI to read the score in a game it will often get it wrong and hallucinate
225+
* Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
226+
* Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
227+
* Image size & FPS need to stay relatively low due to performance constraints
228+
* Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.
229+
230+
231+
## We are hiring
232+
233+
We've recently closed a [\$38 million Series B funding round](https://techcrunch.com/2021/03/04/stream-raises-38m-as-its-chat-and-activity-feed-apis-power-communications-for-1b-users/) and we keep actively growing.
234+
Our APIs are used by more than a billion end-users, and you'll have a chance to make a huge impact on the product within a team of the strongest engineers all over the world.
205235

206-
### Later
236+
Check out our current openings and apply via [Stream's website](https://getstream.io/team/#jobs).
207237

208-
[ ] Buffered video capture (for "catch the moment" scenarios)
209238

210239
## Star History
211240

agents-core/pyproject.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ inworld = ["vision-agents-plugins-inworld"]
4747
kokoro = ["vision-agents-plugins-kokoro"]
4848
moonshine = ["vision-agents-plugins-moonshine"]
4949
openai = ["vision-agents-plugins-openai"]
50+
roboflow = ["vision-agents-plugins-roboflow"]
5051
smart_turn = ["vision-agents-plugins-smart-turn"]
5152
ultralytics = ["vision-agents-plugins-ultralytics"]
5253
wizper = ["vision-agents-plugins-wizper"]
@@ -62,6 +63,7 @@ all-plugins = [
6263
"vision-agents-plugins-inworld",
6364
"vision-agents-plugins-kokoro",
6465
"vision-agents-plugins-moonshine",
66+
"vision-agents-plugins-roboflow",
6567
"vision-agents-plugins-openai",
6668
"vision-agents-plugins-smart-turn",
6769
"vision-agents-plugins-ultralytics",

agents-core/vision_agents/core/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,6 @@
33
from vision_agents.core.agents import Agent
44

55
from vision_agents.core.cli.cli_runner import cli
6+
from vision_agents.core.agents.agent_launcher import AgentLauncher
67

7-
8-
__all__ = ["Agent", "User", "cli"]
8+
__all__ = ["Agent", "User", "cli", "AgentLauncher"]

agents-core/vision_agents/core/agents/agent_launcher.py

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import asyncio
44
import logging
5-
from typing import Optional, TYPE_CHECKING, Callable, Awaitable, Union, cast
5+
from typing import TYPE_CHECKING, Awaitable, Callable, Optional, Union, cast
66

77
if TYPE_CHECKING:
88
from .agents import Agent
@@ -100,13 +100,11 @@ async def warmup(self, **kwargs) -> None:
100100
warmup_tasks.append(agent.turn_detection.warmup())
101101

102102
# Warmup processors
103-
if agent.processors and hasattr(agent.processors, "warmup"):
103+
if agent.processors:
104104
logger.debug("Warming up processors")
105105
for processor in agent.processors:
106106
if hasattr(processor, "warmup"):
107-
logger.debug(
108-
"Warming up processor: %s", processor.__class__.__name__
109-
)
107+
logger.debug("Warming up processor: %s", processor.name)
110108
warmup_tasks.append(processor.warmup())
111109

112110
# Run all warmups in parallel

0 commit comments

Comments
 (0)