Voice (WebSocket)
Add real-time voice conversations using the WebSocket voice transport. The SDK streams audio to the server for speech-to-text, and receives the character's voice response.
How It Works
┌───────────┐ PCM bytes ┌──────────────┐ stream_audio ┌────────────────┐
│ Microphone│────────────────► │ VoiceManager │─────────────────► │ Estuary Server │
└───────────┘ └──────────────┘ └───────┬────────┘
│
┌───────────┐ decode + play ┌──────────────┐ bot_voice │
│ Speaker │◄──────────────────│ AudioPlayer │◄──────────────────────────┘
└───────────┘ └──────────────┘
- Audio is captured and streamed as PCM16 bytes via
stream_audioWebSocket events. - The server runs speech-to-text (Deepgram) and emits
stt_responseevents with transcription results. - Once a final transcription is produced, the server processes it through the AI pipeline.
- The response arrives as
bot_response(text) andbot_voice(audio) events.
Starting Voice
await client.connect()
# Start voice -- continuous mode with VAD-based turn detection
await client.start_voice()
By default, the SDK uses voice_transport="websocket". To be explicit:
config = EstuaryConfig(
server_url="https://api.estuary-ai.com",
api_key="est_your_api_key",
character_id="your-character-uuid",
player_id="user-123",
voice_transport="websocket",
)
Voice Modes
The SDK supports two voice modes:
from estuary_sdk import VoiceMode
# Continuous mode (default) -- VAD detects when you start/stop speaking
await client.start_voice(VoiceMode.CONTINUOUS)
# Push-to-talk -- you control when audio is sent
await client.start_voice(VoiceMode.PUSH_TO_TALK)
Push-to-Talk
In push-to-talk mode, use start_recording() and stop_recording() to control when audio is streamed:
await client.start_voice(VoiceMode.PUSH_TO_TALK)
# User presses button
await client.start_recording()
# User releases button
await client.stop_recording() # Triggers end-of-turn
Sending Raw Audio
You can send raw PCM16 audio bytes directly, regardless of voice mode:
# Send raw PCM16 audio (16-bit signed integer, mono, at configured sample rate)
await client.send_audio(audio_bytes)
send_audio() raises EstuaryError with code VOICE_NOT_ACTIVE if start_voice() has not been called.
Speech-to-Text Events
As the user speaks, the server streams back transcription results:
async def on_stt(response):
if response.is_final:
print("User said:", response.text)
else:
# Interim transcription -- useful for showing live captions
print("Hearing:", response.text)
client.on("stt_response", on_stt)
After a final stt_response, the server automatically triggers the AI pipeline. The response flows through the same bot_response and bot_voice events used by text chat.
Bot Voice Audio
Voice audio arrives as base64-encoded PCM chunks via the bot_voice event:
async def on_voice(voice):
print(f"Audio chunk {voice.chunk_index} for message {voice.message_id}")
if voice.is_final:
print("Last audio chunk received")
client.on("bot_voice", on_voice)
If you installed the audio extra, you can use the built-in AudioPlayer for playback. Otherwise, audio decoding and playback is your responsibility.
Playback Tracking
Notify the server when audio playback completes to help track conversation pacing:
client.notify_audio_playback_complete(message_id)
Stopping Voice
await client.stop_voice()
This stops audio streaming, cleans up the voice manager, and emits a voice_stopped event. The client remains connected for text chat.
Muting
Toggle the microphone without stopping the voice session:
client.toggle_mute()
print("Muted:", client.is_muted)
When muted, audio streaming is paused but the voice session remains active. The character's voice responses continue.
Voice Lifecycle Events
async def on_voice_started():
print("Voice session started")
async def on_voice_stopped():
print("Voice session stopped")
client.on("voice_started", on_voice_started)
client.on("voice_stopped", on_voice_stopped)
Example: Voice Chat
import asyncio
from estuary_sdk import EstuaryClient, EstuaryConfig
config = EstuaryConfig(
server_url="https://api.estuary-ai.com",
api_key="est_your_api_key",
character_id="your-character-uuid",
player_id="user-123",
voice_transport="websocket",
)
async def main():
async with EstuaryClient(config) as client:
async def on_stt(response):
if response.is_final:
print("You:", response.text)
async def on_response(response):
if response.is_final:
print("Bot:", response.text)
async def on_error(err):
print("Error:", err)
client.on("stt_response", on_stt)
client.on("bot_response", on_response)
client.on("error", on_error)
await client.connect()
print("Connected! Starting voice...")
await client.start_voice()
print("Listening -- speak into your microphone.")
# Keep running
await asyncio.Event().wait()
asyncio.run(main())
VoiceSession
VoiceSession is a high-level wrapper that manages the AudioPlayer, AudioRecorder, and all event wiring as an async context manager. It handles:
- Creating and disposing the audio player and recorder
- Wiring
bot_voiceandaudio_receivedevents to the player - Sending
audio_playback_completenotifications automatically - Starting and stopping voice on the client
Continuous Mode
from estuary_sdk import EstuaryClient, EstuaryConfig, VoiceSession, VoiceMode
async def main():
async with EstuaryClient(config) as client:
await client.connect()
async with VoiceSession(client, mode=VoiceMode.CONTINUOUS) as session:
# Microphone is streaming, speaker plays responses automatically.
await asyncio.Event().wait()
Push-to-Talk Mode
async with VoiceSession(client, mode=VoiceMode.PUSH_TO_TALK) as session:
# User presses button
await session.start_recording()
await asyncio.sleep(3)
# User releases button -- triggers end-of-turn
await session.stop_recording()
Constructor
VoiceSession(
client: EstuaryClient,
mode: VoiceMode = VoiceMode.CONTINUOUS,
*,
sample_rate: int = 16000,
on_state: Callable | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
client | EstuaryClient | required | A connected client instance |
mode | VoiceMode | CONTINUOUS | Voice mode |
sample_rate | int | 16000 | Audio sample rate in Hz |
on_state | Callable | None | None | Optional state change callback |
Methods
| Method | Description |
|---|---|
start() | Start the voice session (called automatically by async with) |
stop() | Stop the voice session (called automatically on context exit) |
start_recording() | Begin recording (PTT mode) |
stop_recording() | Stop recording and trigger end-of-turn (PTT mode) |
Properties
| Property | Type | Description |
|---|---|---|
client | EstuaryClient | The underlying client |
is_started | bool | Whether the session is active |
VoiceSession requires the audio extra: pip install estuary-sdk[audio]
Next Steps
- Voice (LiveKit) -- Lower latency voice with WebRTC
- Text Chat -- Send text alongside voice
- API Reference: EstuaryClient -- Full method reference