Voice Agent Starter - Vision Agents

View Simple Agent Example on GitHub

Check out the complete Simple Agent example in our GitHub repository

In this example, we build a conversational voice AI agent using OpenAI for language understanding, ElevenLabs for natural-sounding speech, and Deepgram for speech recognition. The agent joins a video call, greets the user, handles voice conversation, and can observe the camera feed. This is the best starting point for developers new to Vision Agents.

Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

What You Will Build

Listen to user speech and convert it to text with Deepgram STT
Process conversations using OpenAI GPT-4o-mini
Respond with natural-sounding speech via ElevenLabs TTS
Detect when the user has finished speaking with Smart Turn detection
Run on Stream’s low-latency edge network

Next Steps

AI Golf Coach

Add video processing with YOLO pose detection

Integrations

Swap in any of 25+ supported AI providers

Phone Support Agent

⌘I

Documentation Index

View Simple Agent Example on GitHub

​What You Will Build

​Next Steps

AI Golf Coach

Integrations

What You Will Build

Next Steps