Voice Connection

The Estuary Unity SDK supports two voice transports: LiveKit (WebRTC) and WebSocket. LiveKit is recommended for production use because it provides lower latency and native acoustic echo cancellation (AEC).

Voice Modes

Mode	Transport	Latency	AEC	Setup
LiveKit	WebRTC	Low	Native platform AEC	Requires LiveKit server
WebSocket	Socket.IO	Higher	Manual/none	Works everywhere

Set the mode in your EstuaryConfig asset or at runtime:

EstuaryManager.Instance.Config.VoiceMode = VoiceMode.LiveKit;

WebSocket Voice

How It Works

Client captures audio via Unity's Microphone API
Audio is converted to PCM 16-bit, base64-encoded, and sent as stream_audio events
Server runs speech-to-text on the audio stream
Server sends back stt_response events with transcription
On final transcription, server generates a response and streams bot_response + bot_voice events
Client decodes base64 audio and plays via AudioSource

Setup

Add EstuaryMicrophone and EstuaryAudioSource components to your character. The microphone captures at 16 kHz in 100 ms chunks (the rate required by the backend STT pipeline) and uses Unity's AudioSettings.outputSampleRate for TTS playback (24 kHz default per the SDK contract).

Starting a Voice Session

// Automatically (on the EstuaryCharacter component)
autoStartVoiceSession = true;

// Or manually
character.StartVoiceSession();

StartVoiceSession() sends start_voice to the server (enabling server-side STT) and starts the microphone. When you call EndVoiceSession(), it sends stop_voice and stops the mic.

STT Responses

character.OnTranscript += (SttResponse stt) =>
{
    if (stt.IsFinal)
    {
        Debug.Log($"User said: {stt.Text}");
    }
    else
    {
        // Interim result (real-time as user speaks)
        transcriptLabel.text = stt.Text;
    }
};

LiveKit Voice

How It Works

After Socket.IO session is established, SDK requests a LiveKit token
Server returns a JWT token, LiveKit URL, and room name
SDK connects to the LiveKit room using the LiveKit Unity SDK
SDK notifies the server it has joined (livekit_join)
Server joins the bot to the same room (livekit_ready)
Audio flows bidirectionally over WebRTC:
- Client publishes a microphone track (with native AEC)
- Client receives the bot's TTS audio track

Setup

LiveKit mode uses the same EstuaryMicrophone and EstuaryAudioSource components, but the audio transport is different:

Microphone: Uses native WebRTC capture through RtcAudioSource instead of Unity's Microphone API. This enables proper AEC.
Playback: LiveKit's AudioStream handles playback automatically through an AudioSource it creates at runtime.

Connection Flow

// Auto-connect flow (recommended):
// 1. EstuaryCharacter.autoConnect = true
// 2. EstuaryCharacter.autoStartVoiceSession = true
// 3. EstuaryConfig.voiceMode = VoiceMode.LiveKit
//
// The SDK handles the full flow automatically:
// Connect → session_info → request token → join room → livekit_ready → publish mic

// Manual flow:
character.OnConnected += async (session) =>
{
    // Request LiveKit token
    await EstuaryManager.Instance.RequestLiveKitTokenAsync();
};

EstuaryManager.Instance.OnLiveKitReady += (roomName) =>
{
    Debug.Log($"LiveKit ready in room: {roomName}");
    // Audio is now flowing
};

LiveKit Connection States

Monitor the LiveKit connection via EstuaryManager:

EstuaryManager.Instance.OnLiveKitStateChanged += (state) =>
{
    // LiveKitConnectionState:
    // Disconnected → RequestingToken → Connecting →
    // WaitingForBot → Ready
    statusLabel.text = state.ToString();
};

EstuaryAudioSource

The EstuaryAudioSource component handles TTS playback with a ring buffer for smooth streaming.

WebSocket Mode

In WebSocket mode, audio arrives as base64-encoded PCM chunks in bot_voice events. The component:

Decodes base64 to PCM16 bytes
Converts to float samples
Resamples if the server sample rate differs from Unity's output rate
Writes to a ring buffer
Plays via a streaming AudioClip with a PCM reader callback

The ring buffer provides gap-free playback even when network jitter causes irregular chunk delivery.

LiveKit Mode

In LiveKit mode, EstuaryAudioSource defers to LiveKit's AudioStream, which handles playback automatically. The component tracks message_id for interrupt handling.

Interrupts

When the user speaks while the bot is talking, the audio source can automatically interrupt:

// In EstuaryAudioSource Inspector:
autoInterruptOnUserSpeech = true;
microphoneRef = <your EstuaryMicrophone>

This stops playback, clears the audio buffer, and notifies the server to stop generating audio.

Events

audioSource.OnPlaybackStarted += () => { lipSync.Enable(); };
audioSource.OnPlaybackComplete += () => { lipSync.Disable(); };
audioSource.OnPlaybackInterrupted += () => { lipSync.Disable(); };

EstuaryMicrophone

Voice Activity Detection (VAD)

WebSocket mode and LiveKit mode both support optional client-side VAD via private Inspector fields on EstuaryMicrophone (useVoiceActivityDetection, vadThreshold). When enabled in WebSocket mode the SDK gates outgoing audio on detected speech; in LiveKit mode the parallel VAD capture is used purely for client-side interrupts (LiveKit's WebRTC track is always active when unmuted, and backend STT runs server-side).

Push-to-Talk

Push-to-talk is supported via the private pushToTalkKey Inspector field on EstuaryMicrophone. Set a KeyCode other than None and:

LiveKit mode: holding the key unmutes the WebRTC track; releasing it mutes.
WebSocket mode: audio chunks are only sent while the key is held.

Events

microphone.OnRecordingStarted += () => { micIcon.SetActive(true); };
microphone.OnRecordingStopped += () => { micIcon.SetActive(false); };
microphone.OnVolumeChanged += (float vol) => { volumeMeter.value = vol; };
microphone.OnSpeechDetected += () => { /* user started talking */ };
microphone.OnSilenceDetected += () => { /* user stopped talking */ };

Lip Sync Integration

In LiveKit mode, the bot's AudioSource is created at runtime when the bot joins the room. To integrate lip sync:

EstuaryManager.Instance.OnBotAudioSourceCreated += (AudioSource source) =>
{
    // Attach your lip sync analyzer to this AudioSource
    lipSyncAnalyzer.SetAudioSource(source);
};

Next Steps

API Reference: Input Components -- Full EstuaryMicrophone reference
API Reference: Character Components -- Full EstuaryAudioSource reference

Voice Modes​

WebSocket Voice​

How It Works​

Setup​

Starting a Voice Session​

STT Responses​

LiveKit Voice​

How It Works​

Setup​

Connection Flow​

LiveKit Connection States​

EstuaryAudioSource​

WebSocket Mode​

LiveKit Mode​

Interrupts​

Events​

EstuaryMicrophone​

Voice Activity Detection (VAD)​

Push-to-Talk​

Events​

Lip Sync Integration​

Next Steps​

Voice Modes

WebSocket Voice

How It Works

Setup

Starting a Voice Session

STT Responses

LiveKit Voice

How It Works

Setup

Connection Flow

LiveKit Connection States

EstuaryAudioSource

WebSocket Mode

LiveKit Mode

Interrupts

Events

EstuaryMicrophone

Voice Activity Detection (VAD)

Push-to-Talk

Events

Lip Sync Integration

Next Steps