Skip to main content

Voice Connection

The Estuary Unity SDK supports two voice transports: LiveKit (WebRTC) and WebSocket. LiveKit is recommended for production use because it provides lower latency and native acoustic echo cancellation (AEC).

Voice Modes

ModeTransportLatencyAECSetup
LiveKitWebRTCLowNative platform AECRequires LiveKit server
WebSocketSocket.IOHigherManual/noneWorks everywhere

Set the mode in your EstuaryConfig asset or at runtime:

EstuaryManager.Instance.Config.VoiceMode = VoiceMode.LiveKit;

WebSocket Voice

How It Works

  1. Client captures audio via Unity's Microphone API
  2. Audio is converted to PCM 16-bit, base64-encoded, and sent as stream_audio events
  3. Server runs STT (Deepgram) on the audio stream
  4. Server sends back stt_response events with transcription
  5. On final transcription, server generates a response and streams bot_response + bot_voice events
  6. Client decodes base64 audio and plays via AudioSource

Setup

Add EstuaryMicrophone and EstuaryAudioSource components to your character:

// EstuaryMicrophone settings
sampleRate: 16000 // Required for STT
chunkDurationMs: 100 // 100ms audio chunks
pushToTalkKey: KeyCode.None // Always-on (or set a key)

Starting a Voice Session

// Automatically (on the EstuaryCharacter component)
autoStartVoiceSession = true;

// Or manually
character.StartVoiceSession();

StartVoiceSession() sends start_voice to the server (enabling Deepgram STT) and starts the microphone. When you call EndVoiceSession(), it sends stop_voice and stops the mic.

STT Responses

character.OnTranscript += (SttResponse stt) =>
{
if (stt.IsFinal)
{
Debug.Log($"User said: {stt.Text}");
}
else
{
// Interim result (real-time as user speaks)
transcriptLabel.text = stt.Text;
}
};

LiveKit Voice

How It Works

  1. After Socket.IO session is established, SDK requests a LiveKit token
  2. Server returns a JWT token, LiveKit URL, and room name
  3. SDK connects to the LiveKit room using the LiveKit Unity SDK
  4. SDK notifies the server it has joined (livekit_join)
  5. Server joins the bot to the same room (livekit_ready)
  6. Audio flows bidirectionally over WebRTC:
    • Client publishes a microphone track (with native AEC)
    • Client receives the bot's TTS audio track

Setup

LiveKit mode uses the same EstuaryMicrophone and EstuaryAudioSource components, but the audio transport is different:

  • Microphone: Uses native WebRTC capture through RtcAudioSource instead of Unity's Microphone API. This enables proper AEC.
  • Playback: LiveKit's AudioStream handles playback automatically through an AudioSource it creates at runtime.

Connection Flow

// Auto-connect flow (recommended):
// 1. EstuaryCharacter.autoConnect = true
// 2. EstuaryCharacter.autoStartVoiceSession = true
// 3. EstuaryConfig.voiceMode = VoiceMode.LiveKit
//
// The SDK handles the full flow automatically:
// Connect → session_info → request token → join room → livekit_ready → publish mic

// Manual flow:
character.OnConnected += async (session) =>
{
// Request LiveKit token
await EstuaryManager.Instance.RequestLiveKitTokenAsync();
};

EstuaryManager.Instance.OnLiveKitReady += (roomName) =>
{
Debug.Log($"LiveKit ready in room: {roomName}");
// Audio is now flowing
};

LiveKit Connection States

Monitor the LiveKit connection via EstuaryManager:

EstuaryManager.Instance.OnLiveKitStateChanged += (state) =>
{
// LiveKitConnectionState:
// Disconnected → RequestingToken → Connecting →
// WaitingForBot → Ready
statusLabel.text = state.ToString();
};

EstuaryAudioSource

The EstuaryAudioSource component handles TTS playback with a ring buffer for smooth streaming.

WebSocket Mode

In WebSocket mode, audio arrives as base64-encoded PCM chunks in bot_voice events. The component:

  1. Decodes base64 to PCM16 bytes
  2. Converts to float samples
  3. Resamples if the server sample rate differs from Unity's output rate
  4. Writes to a ring buffer
  5. Plays via a streaming AudioClip with a PCM reader callback

The ring buffer provides gap-free playback even when network jitter causes irregular chunk delivery.

LiveKit Mode

In LiveKit mode, EstuaryAudioSource defers to LiveKit's AudioStream, which handles playback automatically. The component tracks message_id for interrupt handling.

Interrupts

When the user speaks while the bot is talking, the audio source can automatically interrupt:

// In EstuaryAudioSource Inspector:
autoInterruptOnUserSpeech = true;
microphoneRef = <your EstuaryMicrophone>

This stops playback, clears the audio buffer, and notifies the server to stop generating audio.

Events

audioSource.OnPlaybackStarted += () => { lipSync.Enable(); };
audioSource.OnPlaybackComplete += () => { lipSync.Disable(); };
audioSource.OnPlaybackInterrupted += () => { lipSync.Disable(); };

EstuaryMicrophone

Voice Activity Detection (VAD)

WebSocket mode supports client-side VAD to avoid sending silence:

// In Inspector:
useVoiceActivityDetection = true;
vadThreshold = 0.5f; // 0-1, higher = less sensitive

In LiveKit mode, VAD runs server-side via Deepgram. You can still enable client-side VAD for local speech detection (used for interrupt triggering).

Push-to-Talk

Set a key for push-to-talk mode:

// In Inspector:
pushToTalkKey = KeyCode.Space;

In LiveKit mode, holding the key unmutes the mic; releasing it mutes.

Events

microphone.OnRecordingStarted += () => { micIcon.SetActive(true); };
microphone.OnRecordingStopped += () => { micIcon.SetActive(false); };
microphone.OnVolumeChanged += (float vol) => { volumeMeter.value = vol; };
microphone.OnSpeechDetected += () => { /* user started talking */ };
microphone.OnSilenceDetected += () => { /* user stopped talking */ };

Lip Sync Integration

In LiveKit mode, the bot's AudioSource is created at runtime when the bot joins the room. To integrate lip sync:

EstuaryManager.Instance.OnBotAudioSourceCreated += (AudioSource source) =>
{
// Attach your lip sync analyzer to this AudioSource
lipSyncAnalyzer.SetAudioSource(source);
};

Next Steps