Platform Overview
Estuary is a real-time AI conversation platform that gives characters persistent memory, voice, spatial awareness, and the ability to take actions in your application. This page describes the architecture that all Estuary SDKs interact with.
Architecture
Estuary uses a Gateway-Worker architecture designed for low-latency streaming:
┌──────────────┐ ┌───────────────┐ ┌────────────────┐
│ Your App │──────>│ Gateway │──────>│ Redis Queue │
│ (SDK) │<──────│ (FastAPI + │<──────│ │
│ │ │ Socket.IO) │ └───────┬────────┘
└──────────────┘ └───────────────┘ │
│ ▼
│ ┌────────────────┐
┌──────┴──────┐ │ Workers │
│ LiveKit │ │ (Stateless, │
│ (WebRTC) │ │ Scalable) │
└─────────────┘ └────────────────┘
│
┌───────┴───────┐
│ Providers │
│ STT, LLM, │
│ TTS, VLM │
└───────────────┘
Gateway
The gateway is a FastAPI server with Socket.IO that handles all client connections. It:
- Authenticates SDK connections using API keys
- Manages WebSocket sessions via Socket.IO v4
- Streams audio to/from clients for speech-to-text
- Negotiates LiveKit WebRTC rooms for low-latency voice
- Routes work to stateless workers via a Redis queue
- Publishes responses back to clients via Redis pub/sub
Workers
Workers are stateless processes that pick up requests from the Redis queue. Each request flows through a pipeline:
- STT -- Transcribe speech to text (Deepgram)
- LLM -- Generate a response (OpenAI) with memory context
- TTS -- Convert the response to speech (ElevenLabs)
Workers stream results back as they become available. Text chunks arrive before the full response is complete, and audio chunks begin streaming as soon as the first sentence is synthesized.
Infrastructure
| Component | Purpose |
|---|---|
| PostgreSQL (pgvector) | Persistent storage: characters, conversations, memory, knowledge graph |
| Redis | Job queue (BRPOP), pub/sub for response routing, session state, rate limiting |
| LiveKit | WebRTC server for low-latency voice and video streaming |
Core Capabilities
Persistent Memory
Every conversation is remembered. The memory system tracks:
- Episodic memory -- What happened in each conversation
- Semantic memory -- Facts and knowledge extracted over time
- Emotional memory -- Relationship dynamics and sentiment
- Core facts -- Key information about each user
- Entities -- People, places, and things mentioned across conversations
See Memory System for details.
Real-Time Voice
Estuary supports two voice transports:
- WebSocket voice -- Audio streamed as base64-encoded PCM over Socket.IO. Works everywhere, higher latency.
- LiveKit voice -- Audio streamed via WebRTC. Low latency with acoustic echo cancellation (AEC). Preferred when available.
Both modes use the same server-side pipeline: Deepgram for STT, the LLM for response generation, and ElevenLabs for TTS.
Spatial Awareness
With a camera feed, Estuary builds a real-time scene graph of the user's environment:
- Object detection and tracking
- Spatial relationships between objects
- Surface detection (tables, floors, walls)
- Room identification
- User activity recognition
The scene graph is injected into the LLM context so characters can reference what they see.
Action System
Characters can trigger actions in your application by embedding XML tags in their responses:
<action name="wave" />
<action name="navigate" target="kitchen" />
SDKs parse these tags and dispatch them to your application code. See Action Protocol for details.
How SDKs Connect
All Estuary SDKs communicate with the gateway over two channels:
Socket.IO (Primary)
Socket.IO v4 over WebSocket is the primary transport. SDKs connect to the /sdk namespace with authentication credentials and exchange events for text, voice, vision, and control signals.
Client Server
│ │
│──── connect /sdk (auth) ─────>│
│<──── session_info ────────────│
│ │
│──── text {text} ─────────────>│
│<──── bot_response (stream) ───│
│<──── bot_voice (stream) ──────│
│ │
│──── start_voice ─────────────>│
│──── stream_audio ────────────>│
│<──── stt_response ────────────│
│ │
REST API
The REST API provides access to configuration and data that does not require real-time streaming:
- Character management
- Conversation history
- Memory and knowledge graph queries
- Analytics and usage data
REST endpoints use the X-API-Key header for authentication.
Next Steps
- Authentication -- How API keys and session handshakes work
- Conversation Protocol -- Full Socket.IO event reference
- Memory System -- How memory works across conversations
- Action Protocol -- Triggering application actions from AI responses