Skip to main content

Documentation Index

Fetch the complete documentation index at: https://visionagents.ai/llms.txt

Use this file to discover all available pages before exploring further.

View Simple Agent Example on GitHub

Check out the complete Simple Agent example in our GitHub repository
In this example, we build a conversational voice AI agent using OpenAI for language understanding, ElevenLabs for natural-sounding speech, and Deepgram for speech recognition. The agent joins a video call, greets the user, handles voice conversation, and can observe the camera feed. This is the best starting point for developers new to Vision Agents.
Vision Agents requires a Stream account for real-time transport. Most providers offer free tiers to get started.

What You Will Build

  • Listen to user speech and convert it to text with Deepgram STT
  • Process conversations using OpenAI GPT-4o-mini
  • Respond with natural-sounding speech via ElevenLabs TTS
  • Detect when the user has finished speaking with Smart Turn detection
  • Run on Stream’s low-latency edge network

Next Steps

AI Golf Coach

Add video processing with YOLO pose detection

Integrations

Swap in any of 25+ supported AI providers