Skip to main content

Voice (WebSocket)

Add real-time voice conversations using the WebSocket voice transport. The SDK streams audio to the server for speech-to-text, and receives the character's voice response.

How It Works

┌───────────┐    PCM bytes       ┌──────────────┐    stream_audio    ┌────────────────┐
│ Microphone│────────────────► │ VoiceManager │─────────────────► │ Estuary Server │
└───────────┘ └──────────────┘ └───────┬────────┘

┌───────────┐ decode + play ┌──────────────┐ bot_voice │
│ Speaker │◄──────────────────│ AudioPlayer │◄──────────────────────────┘
└───────────┘ └──────────────┘
  1. Audio is captured and streamed as PCM16 bytes via stream_audio WebSocket events.
  2. The server runs speech-to-text (Deepgram) and emits stt_response events with transcription results.
  3. Once a final transcription is produced, the server processes it through the AI pipeline.
  4. The response arrives as bot_response (text) and bot_voice (audio) events.

Starting Voice

await client.connect()

# Start voice -- continuous mode with VAD-based turn detection
await client.start_voice()

By default, the SDK uses voice_transport="websocket". To be explicit:

config = EstuaryConfig(
server_url="https://api.estuary-ai.com",
api_key="est_your_api_key",
character_id="your-character-uuid",
player_id="user-123",
voice_transport="websocket",
)

Voice Modes

The SDK supports two voice modes:

from estuary_sdk import VoiceMode

# Continuous mode (default) -- VAD detects when you start/stop speaking
await client.start_voice(VoiceMode.CONTINUOUS)

# Push-to-talk -- you control when audio is sent
await client.start_voice(VoiceMode.PUSH_TO_TALK)

Push-to-Talk

In push-to-talk mode, use start_recording() and stop_recording() to control when audio is streamed:

await client.start_voice(VoiceMode.PUSH_TO_TALK)

# User presses button
await client.start_recording()

# User releases button
await client.stop_recording() # Triggers end-of-turn

Sending Raw Audio

You can send raw PCM16 audio bytes directly, regardless of voice mode:

# Send raw PCM16 audio (16-bit signed integer, mono, at configured sample rate)
await client.send_audio(audio_bytes)
info

send_audio() raises EstuaryError with code VOICE_NOT_ACTIVE if start_voice() has not been called.

Speech-to-Text Events

As the user speaks, the server streams back transcription results:

async def on_stt(response):
if response.is_final:
print("User said:", response.text)
else:
# Interim transcription -- useful for showing live captions
print("Hearing:", response.text)

client.on("stt_response", on_stt)

After a final stt_response, the server automatically triggers the AI pipeline. The response flows through the same bot_response and bot_voice events used by text chat.

Bot Voice Audio

Voice audio arrives as base64-encoded PCM chunks via the bot_voice event:

async def on_voice(voice):
print(f"Audio chunk {voice.chunk_index} for message {voice.message_id}")
if voice.is_final:
print("Last audio chunk received")

client.on("bot_voice", on_voice)

If you installed the audio extra, you can use the built-in AudioPlayer for playback. Otherwise, audio decoding and playback is your responsibility.

Playback Tracking

Notify the server when audio playback completes to help track conversation pacing:

client.notify_audio_playback_complete(message_id)

Stopping Voice

await client.stop_voice()

This stops audio streaming, cleans up the voice manager, and emits a voice_stopped event. The client remains connected for text chat.

Muting

Toggle the microphone without stopping the voice session:

client.toggle_mute()
print("Muted:", client.is_muted)

When muted, audio streaming is paused but the voice session remains active. The character's voice responses continue.

Voice Lifecycle Events

async def on_voice_started():
print("Voice session started")

async def on_voice_stopped():
print("Voice session stopped")

client.on("voice_started", on_voice_started)
client.on("voice_stopped", on_voice_stopped)

Example: Voice Chat

import asyncio
from estuary_sdk import EstuaryClient, EstuaryConfig

config = EstuaryConfig(
server_url="https://api.estuary-ai.com",
api_key="est_your_api_key",
character_id="your-character-uuid",
player_id="user-123",
voice_transport="websocket",
)

async def main():
async with EstuaryClient(config) as client:
async def on_stt(response):
if response.is_final:
print("You:", response.text)

async def on_response(response):
if response.is_final:
print("Bot:", response.text)

async def on_error(err):
print("Error:", err)

client.on("stt_response", on_stt)
client.on("bot_response", on_response)
client.on("error", on_error)

await client.connect()
print("Connected! Starting voice...")

await client.start_voice()
print("Listening -- speak into your microphone.")

# Keep running
await asyncio.Event().wait()

asyncio.run(main())

VoiceSession

VoiceSession is a high-level wrapper that manages the AudioPlayer, AudioRecorder, and all event wiring as an async context manager. It handles:

  • Creating and disposing the audio player and recorder
  • Wiring bot_voice and audio_received events to the player
  • Sending audio_playback_complete notifications automatically
  • Starting and stopping voice on the client

Continuous Mode

from estuary_sdk import EstuaryClient, EstuaryConfig, VoiceSession, VoiceMode

async def main():
async with EstuaryClient(config) as client:
await client.connect()

async with VoiceSession(client, mode=VoiceMode.CONTINUOUS) as session:
# Microphone is streaming, speaker plays responses automatically.
await asyncio.Event().wait()

Push-to-Talk Mode

async with VoiceSession(client, mode=VoiceMode.PUSH_TO_TALK) as session:
# User presses button
await session.start_recording()

await asyncio.sleep(3)

# User releases button -- triggers end-of-turn
await session.stop_recording()

Constructor

VoiceSession(
client: EstuaryClient,
mode: VoiceMode = VoiceMode.CONTINUOUS,
*,
sample_rate: int = 16000,
on_state: Callable | None = None,
)
ParameterTypeDefaultDescription
clientEstuaryClientrequiredA connected client instance
modeVoiceModeCONTINUOUSVoice mode
sample_rateint16000Audio sample rate in Hz
on_stateCallable | NoneNoneOptional state change callback

Methods

MethodDescription
start()Start the voice session (called automatically by async with)
stop()Stop the voice session (called automatically on context exit)
start_recording()Begin recording (PTT mode)
stop_recording()Stop recording and trigger end-of-turn (PTT mode)

Properties

PropertyTypeDescription
clientEstuaryClientThe underlying client
is_startedboolWhether the session is active
info

VoiceSession requires the audio extra: pip install estuary-sdk[audio]

Next Steps