Voice (WebSocket)

Add real-time voice conversations using the WebSocket voice transport. The SDK streams audio to the server for speech-to-text, and receives the character's voice response.

How It Works

┌───────────┐    PCM bytes       ┌──────────────┐    stream_audio    ┌────────────────┐
│ Microphone│────────────────►  │ VoiceManager │─────────────────►  │ Estuary Server │
└───────────┘                   └──────────────┘                    └───────┬────────┘
                                                                           │
┌───────────┐    decode + play   ┌──────────────┐    bot_voice            │
│  Speaker  │◄──────────────────│ AudioPlayer  │◄──────────────────────────┘
└───────────┘                   └──────────────┘

Audio is captured and streamed as PCM16 bytes via stream_audio WebSocket events.
The server runs speech-to-text and emits stt_response events with transcription results.
Once a final transcription is produced, the server processes it through the AI pipeline.
The response arrives as bot_response (text) and bot_voice (audio) events.

Starting Voice

await client.connect()

# Start voice -- continuous mode with VAD-based turn detection
await client.start_voice()

By default, the SDK uses voice_transport="websocket". To be explicit:

config = EstuaryConfig(
    server_url="https://api.estuary-ai.com",
    api_key="est_your_api_key",
    character_id="your-character-uuid",
    player_id="user-123",
    voice_transport="websocket",
)

Voice Modes

The SDK supports two voice modes:

from estuary_sdk import VoiceMode

# Continuous mode (default) -- VAD detects when you start/stop speaking
await client.start_voice(VoiceMode.CONTINUOUS)

# Push-to-talk -- you control when audio is sent
await client.start_voice(VoiceMode.PUSH_TO_TALK)

Push-to-Talk

In push-to-talk mode, use start_recording() and stop_recording() to control when audio is streamed:

await client.start_voice(VoiceMode.PUSH_TO_TALK)

# User presses button
await client.start_recording()

# User releases button
await client.stop_recording()  # Triggers end-of-turn

Sending Raw Audio

You can send raw PCM16 audio bytes directly, regardless of voice mode:

# Send raw PCM16 audio (16-bit signed integer, mono, at configured sample rate)
await client.send_audio(audio_bytes)

info

send_audio() raises EstuaryError with code VOICE_NOT_ACTIVE if start_voice() has not been called.

Speech-to-Text Events

As the user speaks, the server streams back transcription results:

async def on_stt(response):
    if response.is_final:
        print("User said:", response.text)
    else:
        # Interim transcription -- useful for showing live captions
        print("Hearing:", response.text)

client.on("stt_response", on_stt)

After a final stt_response, the server automatically triggers the AI pipeline. The response flows through the same bot_response and bot_voice events used by text chat.

Bot Voice Audio

Voice audio arrives as base64-encoded PCM chunks via the bot_voice event:

async def on_voice(voice):
    print(f"Audio chunk {voice.chunk_index} for message {voice.message_id}")
    if voice.is_final:
        print("Last audio chunk received")

client.on("bot_voice", on_voice)

If you installed the audio extra, you can use the built-in AudioPlayer for playback. Otherwise, audio decoding and playback is your responsibility.

Playback Tracking

Notify the server when audio playback completes to help track conversation pacing:

client.notify_audio_playback_complete(message_id)

Stopping Voice

await client.stop_voice()

This stops audio streaming, cleans up the voice manager, and emits a voice_stopped event. The client remains connected for text chat.

Muting

Toggle the microphone without stopping the voice session:

client.toggle_mute()
print("Muted:", client.is_muted)

When muted, audio streaming is paused but the voice session remains active. The character's voice responses continue.

Voice Lifecycle Events

async def on_voice_started():
    print("Voice session started")

async def on_voice_stopped():
    print("Voice session stopped")

client.on("voice_started", on_voice_started)
client.on("voice_stopped", on_voice_stopped)

Example: Voice Chat

import asyncio
from estuary_sdk import EstuaryClient, EstuaryConfig

config = EstuaryConfig(
    server_url="https://api.estuary-ai.com",
    api_key="est_your_api_key",
    character_id="your-character-uuid",
    player_id="user-123",
    voice_transport="websocket",
)

async def main():
    async with EstuaryClient(config) as client:
        async def on_stt(response):
            if response.is_final:
                print("You:", response.text)

        async def on_response(response):
            if response.is_final:
                print("Bot:", response.text)

        async def on_error(err):
            print("Error:", err)

        client.on("stt_response", on_stt)
        client.on("bot_response", on_response)
        client.on("error", on_error)

        await client.connect()
        print("Connected! Starting voice...")

        await client.start_voice()
        print("Listening -- speak into your microphone.")

        # Keep running
        await asyncio.Event().wait()

asyncio.run(main())

VoiceSession

VoiceSession is a high-level wrapper that manages the AudioPlayer, AudioRecorder, and all event wiring as an async context manager. It handles:

Creating and disposing the audio player and recorder
Wiring bot_voice and audio_received events to the player
Sending audio_playback_complete notifications automatically
Starting and stopping voice on the client

Continuous Mode

from estuary_sdk import EstuaryClient, EstuaryConfig, VoiceSession, VoiceMode

async def main():
    async with EstuaryClient(config) as client:
        await client.connect()

        async with VoiceSession(client, mode=VoiceMode.CONTINUOUS) as session:
            # Microphone is streaming, speaker plays responses automatically.
            await asyncio.Event().wait()

Push-to-Talk Mode

async with VoiceSession(client, mode=VoiceMode.PUSH_TO_TALK) as session:
    # User presses button
    await session.start_recording()

    await asyncio.sleep(3)

    # User releases button -- triggers end-of-turn
    await session.stop_recording()

Constructor

VoiceSession(
    client: EstuaryClient,
    mode: VoiceMode = VoiceMode.CONTINUOUS,
    *,
    sample_rate: int = 24000,
    on_playback_event: Callable[[str, str | None], None] | None = None,
    volume: float = 1.0,
)

Parameter	Type	Default	Description
`client`	`EstuaryClient`	required	A connected client instance
`mode`	`VoiceMode`	`CONTINUOUS`	Voice mode
`sample_rate`	`int`	`24000`	Audio sample rate in Hz
`on_playback_event`	`Callable \| None`	`None`	Optional playback event callback
`volume`	`float`	`1.0`	Output volume (0.0-1.0)

Methods

Method	Description
`start()`	Start the voice session (called automatically by `async with`)
`stop()`	Stop the voice session (called automatically on context exit)
`start_recording()`	Begin recording (PTT mode)
`stop_recording()`	Stop recording and trigger end-of-turn (PTT mode)

Properties

Property	Type	Description
`client`	`EstuaryClient`	The underlying client
`is_started`	`bool`	Whether the session is active

info

VoiceSession requires the audio extra: pip install estuary-sdk[audio]

Next Steps

Voice (LiveKit) -- Lower latency voice with WebRTC
Text Chat -- Send text alongside voice
API Reference: EstuaryClient -- Full method reference

How It Works​

Starting Voice​

Voice Modes​

Push-to-Talk​

Sending Raw Audio​

Speech-to-Text Events​

Bot Voice Audio​

Playback Tracking​

Stopping Voice​

Muting​

Voice Lifecycle Events​

Example: Voice Chat​

VoiceSession​

Continuous Mode​

Push-to-Talk Mode​

Constructor​

Methods​

Properties​

Next Steps​