You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apps like Cluely offer realtime coaching via an invisible overlay. This example shows you how you can build your own invisible assistant.
@@ -107,7 +106,7 @@ Get a free API key from [Stream](https://getstream.io/). Developers receive **33
107
106
108
107
|**Plugin Name**|**Description**|**Docs Link**|
109
108
|-------------|-------------|-----------|
110
-
| AWS Polly | TTS plugin using Amazon's cloud-based service with natural-sounding voices and neural engine support |[AWS Polly](https://visionagents.ai/integrations/aws-polly)|
109
+
| AWS | AWS (Bedrock) integration with support for standard LLM (Qwen, Claude with vision), realtime with Nova 2 Sonic, and TTS with AWS Polly |[AWS](https://visionagents.ai/integrations/aws)|
111
110
| Cartesia | TTS plugin for realistic voice synthesis in real-time voice applications |[Cartesia](https://visionagents.ai/integrations/cartesia)|
112
111
| Decart | Real-time video restyling capabilities using generative AI models |[Decart](https://visionagents.ai/integrations/decart)|
113
112
| Deepgram | STT plugin for fast, accurate real-time transcription with speaker diarization |[Deepgram](https://visionagents.ai/integrations/deepgram)|
@@ -146,6 +145,16 @@ Check out our getting started guide at [VisionAgents.ai](https://visionagents.ai
**Tutorial:**[Building a real-time meeting assistant](https://github.com/GetStream/Vision-Agents#)
148
147
148
+
## Examples
149
+
150
+
| 🔮 Demo Applications ||
151
+
|:-----|---------|
152
+
| <br><h3>Cartesia</h3>Using Cartesia's Sonic 3 model to visually look at what's in the frame and tell a story with emotion.<br><br>• Real-time visual understanding<br>• Emotional storytelling<br>• Frame-by-frame analysis<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/cartesia/example)| <imgsrc="assets/demo_gifs/cartesia.gif"width="320"alt="Cartesia Demo"> |
153
+
| <br><h3>Realtime Stable Diffusion</h3>Realtime stable diffusion using Vision Agents and Decart's Mirage 2 model to create interactive scenes and stories.<br><br>• Real-time video restyling<br>• Interactive scene generation<br>• Stable diffusion integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/plugins/decart/example)| <imgsrc="assets/demo_gifs/mirage.gif"width="320"alt="Mirage Demo"> |
154
+
| <br><h3>Golf Coach</h3>Using Gemini Live together with Vision Agents and Ultralytics YOLO, we're able to track the user's pose and provide realtime actionable feedback on their golf game.<br><br>• Real-time pose tracking<br>• Actionable coaching feedback<br>• YOLO pose detection<br>• Gemini Live integration<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/02_golf_coach_example)| <imgsrc="assets/demo_gifs/golf.gif"width="320"alt="Golf Coach Demo"> |
155
+
| <br><h3>GeoGuesser</h3>Together with OpenAI Realtime and Vision Agents, we can take GeoGuesser to the next level by asking it to identify places in our real world surroundings.<br><br>• Real-world location identification<br>• OpenAI Realtime integration<br>• Visual scene understanding<br><br> [>Source Code and tutorial](https://github.com/GetStream/Vision-Agents/tree/main/examples/other_examples/openai_realtime_webrtc)| <imgsrc="assets/demo_gifs/geoguesser.gif"width="320"alt="GeoGuesser Demo"> |
156
+
157
+
149
158
## Development
150
159
151
160
See [DEVELOPMENT.md](DEVELOPMENT.md)
@@ -192,20 +201,40 @@ Our favorite people & projects to follow for vision AI
192
201
193
202
### 0.2 - Simplification - Nov
194
203
195
-
-Simplify the library & improved code quality
204
+
-Simplified the library & improved code quality
196
205
- Deepgram Nova 3, Elevenlabs Scribe 2, Fish, Moondream, QWen3, Smart turn, Vogent, Inworld, Heygen, AWS and more
- Tips on deploying agents at scale, monitoring them etc.
217
+
- Guides on SIP & RAG
218
+
219
+
## Vision AI limitations
220
+
221
+
Video AI is the frontier of AI. The state of the art is changing daily to help models understand live video.
222
+
While building the integrations, here are the limitations we've noticed (Dec 2025)
223
+
224
+
* Video AI struggles with small text. If you want the AI to read the score in a game it will often get it wrong and hallucinate
225
+
* Longer videos can cause the AI to lose context. For instance if it's watching a soccer match it will get confused after 30 seconds
226
+
* Most applications require a combination of small specialized models like Yolo/Roboflow/Moondream, API calls to get more context and larger models like gemini/openAI
227
+
* Image size & FPS need to stay relatively low due to performance constraints
228
+
* Video doesn’t trigger responses in realtime models. You always need to send audio/text to trigger a response.
229
+
230
+
231
+
## We are hiring
232
+
233
+
We've recently closed a [\$38 million Series B funding round](https://techcrunch.com/2021/03/04/stream-raises-38m-as-its-chat-and-activity-feed-apis-power-communications-for-1b-users/) and we keep actively growing.
234
+
Our APIs are used by more than a billion end-users, and you'll have a chance to make a huge impact on the product within a team of the strongest engineers all over the world.
205
235
206
-
### Later
236
+
Check out our current openings and apply via [Stream's website](https://getstream.io/team/#jobs).
207
237
208
-
[] Buffered video capture (for "catch the moment" scenarios)
0 commit comments