Skip to main content

Platform Overview

Estuary is a real-time AI conversation platform that gives characters persistent memory, voice, spatial awareness, and the ability to take actions in your application. This page describes the architecture that all Estuary SDKs interact with.

Architecture

Estuary uses a Gateway-Worker architecture designed for low-latency streaming:

┌──────────────┐       ┌───────────────┐       ┌────────────────┐
│ Your App │──────>│ Gateway │──────>│ Redis Queue │
│ (SDK) │<──────│ (FastAPI + │<──────│ │
│ │ │ Socket.IO) │ └───────┬────────┘
└──────────────┘ └───────────────┘ │
│ ▼
│ ┌────────────────┐
┌──────┴──────┐ │ Workers │
│ LiveKit │ │ (Stateless, │
│ (WebRTC) │ │ Scalable) │
└─────────────┘ └────────────────┘

┌───────┴───────┐
│ Providers │
│ STT, LLM, │
│ TTS, VLM │
└───────────────┘

Gateway

The gateway is a FastAPI server with Socket.IO that handles all client connections. It:

  • Authenticates SDK connections using API keys
  • Manages WebSocket sessions via Socket.IO v4
  • Streams audio to/from clients for speech-to-text
  • Negotiates LiveKit WebRTC rooms for low-latency voice
  • Routes work to stateless workers via a Redis queue
  • Publishes responses back to clients via Redis pub/sub

Workers

Workers are stateless processes that pick up requests from the Redis queue. Each request flows through a pipeline:

  1. STT -- Transcribe speech to text (Deepgram)
  2. LLM -- Generate a response (OpenAI) with memory context
  3. TTS -- Convert the response to speech (ElevenLabs)

Workers stream results back as they become available. Text chunks arrive before the full response is complete, and audio chunks begin streaming as soon as the first sentence is synthesized.

Infrastructure

ComponentPurpose
PostgreSQL (pgvector)Persistent storage: characters, conversations, memory, knowledge graph
RedisJob queue (BRPOP), pub/sub for response routing, session state, rate limiting
LiveKitWebRTC server for low-latency voice and video streaming

Core Capabilities

Persistent Memory

Every conversation is remembered. The memory system tracks:

  • Episodic memory -- What happened in each conversation
  • Semantic memory -- Facts and knowledge extracted over time
  • Emotional memory -- Relationship dynamics and sentiment
  • Core facts -- Key information about each user
  • Entities -- People, places, and things mentioned across conversations

See Memory System for details.

Real-Time Voice

Estuary supports two voice transports:

  • WebSocket voice -- Audio streamed as base64-encoded PCM over Socket.IO. Works everywhere, higher latency.
  • LiveKit voice -- Audio streamed via WebRTC. Low latency with acoustic echo cancellation (AEC). Preferred when available.

Both modes use the same server-side pipeline: Deepgram for STT, the LLM for response generation, and ElevenLabs for TTS.

Spatial Awareness

With a camera feed, Estuary builds a real-time scene graph of the user's environment:

  • Object detection and tracking
  • Spatial relationships between objects
  • Surface detection (tables, floors, walls)
  • Room identification
  • User activity recognition

The scene graph is injected into the LLM context so characters can reference what they see.

Action System

Characters can trigger actions in your application by embedding XML tags in their responses:

<action name="wave" />
<action name="navigate" target="kitchen" />

SDKs parse these tags and dispatch them to your application code. See Action Protocol for details.


How SDKs Connect

All Estuary SDKs communicate with the gateway over two channels:

Socket.IO (Primary)

Socket.IO v4 over WebSocket is the primary transport. SDKs connect to the /sdk namespace with authentication credentials and exchange events for text, voice, vision, and control signals.

Client                          Server
│ │
│──── connect /sdk (auth) ─────>│
│<──── session_info ────────────│
│ │
│──── text {text} ─────────────>│
│<──── bot_response (stream) ───│
│<──── bot_voice (stream) ──────│
│ │
│──── start_voice ─────────────>│
│──── stream_audio ────────────>│
│<──── stt_response ────────────│
│ │

REST API

The REST API provides access to configuration and data that does not require real-time streaming:

  • Character management
  • Conversation history
  • Memory and knowledge graph queries
  • Analytics and usage data

REST endpoints use the X-API-Key header for authentication.


Next Steps