GetStream
diff --git a/‎.env.example‎
Lines changed: 4 additions & 0 deletions b/‎.env.example‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 42 additions & 13 deletions b/‎README.md‎
Lines changed: 42 additions & 13 deletions
diff --git a/‎agents-core/pyproject.toml‎
Lines changed: 2 additions & 0 deletions b/‎agents-core/pyproject.toml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎agents-core/vision_agents/core/__init__.py‎
Lines changed: 2 additions & 2 deletions b/‎agents-core/vision_agents/core/__init__.py‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎agents-core/vision_agents/core/agents/agent_launcher.py‎
Lines changed: 3 additions & 5 deletions b/‎agents-core/vision_agents/core/agents/agent_launcher.py‎
Lines changed: 3 additions & 5 deletions
@@ -25,3 +25,7 @@ CARTESIA_API_KEY=your_cartesia_api_key_here
 
 # Anthropic API credentials
 ANTHROPIC_API_KEY=your_anthropic_api_key_here
+
+# Roboflow API credentials
+ROBOFLOW_API_KEY=your_roboflow_api_key_here
+ROBOFLOW_API_URL=your_roboflow_api_url_here
@@ -12,9 +12,8 @@
 
 ## Build Real-Time Vision AI Agents
 
-<a href="https://youtu.be/Hpl5EcCpLw8">
-  <img src="assets/demo_thumbnail.png" alt="Watch the demo" style="width:100%; max-width:900px;">
-</a>
+https://github.com/user-attachments/assets/d9778ab9-938d-4101-8605-ff879c29b0e4
+
 
 ### Multi-modal AI agents that watch, listen, and understand video.
 
@@ -28,12 +27,16 @@ Vision Agents give you the building blocks to create intelligent, low-latency vi
 - **Native APIs:** Native SDK methods from OpenAI (`create response`), Gemini (`generate`), and Claude (`create message`) — always access the latest LLM capabilities.
 - **SDKs:** SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.
 
+https://github.com/user-attachments/assets/d66587ea-7af4-40c4-9966-5c04fbcf467c
+
 ---
 
 ## See It In Action
 
 ### Sports Coaching
 
+https://github.com/user-attachments/assets/9527ab03-0541-493b-97b1-e17ff1b20e21
+
 This example shows you how to build golf coaching AI with YOLO and OpenAI realtime.
 Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
 For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.
@@ -54,10 +57,6 @@ This example shows you how to build golf coaching AI with YOLO and OpenAI realti
 Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases.
 For example: Drone fire detection. Sports/video game coaching. Physical therapy. Workout coaching, Just dance style games etc.
 
-<a href="https://x.com/nash0x7e2/status/1950341779745599769">
-  <img src="assets/golf_example_tweet.png" alt="Golf Example" style="width:100%; max-width:800px;">
-</a>
-
 ### Cluely style Invisible Assistant (coming soon)
 
 Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant.
@@ -107,7 +106,7 @@ Get a free API key from [Stream](https://getstream.io/). Developers receive **33
 
 | **Plugin Name** | **Description** | **Docs Link** |
 |-------------|-------------|-----------|
-| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support | [AWS Polly](https://visionagents.ai/integrations/aws-polly) |
+| AWS | AWS (Bedrock) integration with support for standard LLM (Qwen, Claude with vision), realtime with Nova 2 Sonic, and TTS with AWS Polly | [AWS](https://visionagents.ai/integrations/aws) |
 | Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications | [Cartesia](https://visionagents.ai/integrations/cartesia) |
 | Decart | Real-time video restyling capabilities using generative AI models | [Decart](https://visionagents.ai/integrations/decart) |
 | Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization | [Deepgram](https://visionagents.ai/integrations/deepgram) |
@@ -146,6 +145,16 @@ Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai
 **Tutorial:** [Building real-time sports coaching](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)
 **Tutorial:** [Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)
 
+## Examples
+
+| 🔮 Demo Applications | |
+|:-----|---------|
+|  <br><h3>Cartesia</h3>Using Cartesia's Sonic 3 model to visually look at what's in the frame and tell a story with emotion.<br><br>• Real-time visual understanding<br>• Emotional storytelling<br>• Frame-by-frame analysis<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/cartesia/example) | <img src="assets/demo_gifs/cartesia.gif" width="320" alt="Cartesia Demo"> |
+|  <br><h3>Realtime Stable Diffusion</h3>Realtime stable diffusion using Vision Agents and Decart's Mirage 2 model to create interactive scenes and stories.<br><br>• Real-time video restyling<br>• Interactive scene generation<br>• Stable diffusion integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/decart/example) | <img src="assets/demo_gifs/mirage.gif" width="320" alt="Mirage Demo"> |
+|  <br><h3>Golf Coach</h3>Using Gemini Live together with Vision Agents and Ultralytics YOLO, we're able to track the user's pose and provide realtime actionable feedback on their golf game.<br><br>• Real-time pose tracking<br>• Actionable coaching feedback<br>• YOLO pose detection<br>• Gemini Live integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example) | <img src="assets/demo_gifs/golf.gif" width="320" alt="Golf Coach Demo"> |
+|  <br><h3>GeoGuesser</h3>Together with OpenAI Realtime and Vision Agents, we can take GeoGuesser to the next level by asking it to identify places in our real world surroundings.<br><br>• Real-world location identification<br>• OpenAI Realtime integration<br>• Visual scene understanding<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/other_examples/openai_realtime_webrtc)| <img src="assets/demo_gifs/geoguesser.gif" width="320" alt="GeoGuesser Demo"> |
+
+
 ## Development
 
 See [DEVELOPMENT.md](DEVELOPMENT.md)
@@ -192,20 +201,40 @@ Our favorite people & projects to follow for vision AI
 
 ### 0.2 - Simplification - Nov
 
-- Simplify the library & improved code quality
+- Simplified the library & improved code quality
 - Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
 - Improved openAI & Gemini realtime performance
 - Audio & Video utilities
 
-### 0.3 - Demos - Nov/Dec
+### 0.3 - Demos - Early Dec
 
-### 0.4 - Deploys
+- Mirage, Roboflow, Facial recognition. Nicer demos
+- Deepgram Flux & Elevenlabs Scribe improvements
+
+### 0.4 - Deploys - December
 
 - Tips on deploying agents at scale, monitoring them etc.
+- Guides on SIP & RAG
+
+## Vision AI limitations
+
+Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video.
+While building the integrations, here are the limitations we've noticed (Dec 2025)
+
+* Video AI struggles with small text. If you want the AI to read the score in a game it will often get it wrong and hallucinate
+* Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
+* Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
+* Image size & FPS need to stay relatively low due to performance constraints
+* Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.
+
+
+## We are hiring
+
+We've recently closed a [\$38 million Series B funding round](https://techcrunch.com/2021/03/04/stream-raises-38m-as-its-chat-and-activity-feed-apis-power-communications-for-1b-users/) and we keep actively growing.
+Our APIs are used by more than a billion end-users, and you'll have a chance to make a huge impact on the product within a team of the strongest engineers all over the world.
 
-### Later
+Check out our current openings and apply via [Stream's website](https://getstream.io/team/#jobs).
 
-[ ] Buffered video capture (for "catch the moment" scenarios)
 
 ## Star History
 
 
@@ -47,6 +47,7 @@ inworld = ["vision-agents-plugins-inworld"]
 kokoro = ["vision-agents-plugins-kokoro"]
 moonshine = ["vision-agents-plugins-moonshine"]
 openai = ["vision-agents-plugins-openai"]
+roboflow = ["vision-agents-plugins-roboflow"]
 smart_turn = ["vision-agents-plugins-smart-turn"]
 ultralytics = ["vision-agents-plugins-ultralytics"]
 wizper = ["vision-agents-plugins-wizper"]
@@ -62,6 +63,7 @@ all-plugins = [
   "vision-agents-plugins-inworld",
   "vision-agents-plugins-kokoro",
   "vision-agents-plugins-moonshine",
+  "vision-agents-plugins-roboflow",
   "vision-agents-plugins-openai",
   "vision-agents-plugins-smart-turn",
   "vision-agents-plugins-ultralytics",
 
@@ -3,6 +3,6 @@
 from vision_agents.core.agents import Agent
 
 from vision_agents.core.cli.cli_runner import cli
+from vision_agents.core.agents.agent_launcher import AgentLauncher
 
-
-__all__ = ["Agent", "User", "cli"]
+__all__ = ["Agent", "User", "cli", "AgentLauncher"]
@@ -2,7 +2,7 @@
 
 import asyncio
 import logging
-from typing import Optional, TYPE_CHECKING, Callable, Awaitable, Union, cast
+from typing import TYPE_CHECKING, Awaitable, Callable, Optional, Union, cast
 
 if TYPE_CHECKING:
     from .agents import Agent
@@ -100,13 +100,11 @@ async def warmup(self, **kwargs) -> None:
                 warmup_tasks.append(agent.turn_detection.warmup())
 
             # Warmup processors
-            if agent.processors and hasattr(agent.processors, "warmup"):
+            if agent.processors:
                 logger.debug("Warming up processors")
                 for processor in agent.processors:
                     if hasattr(processor, "warmup"):
-                        logger.debug(
-                            "Warming up processor: %s", processor.__class__.__name__
-                        )
+                        logger.debug("Warming up processor: %s", processor.name)
                         warmup_tasks.append(processor.warmup())
 
             # Run all warmups in parallel