Voice Connection
The Estuary Unity SDK supports two voice transports: LiveKit (WebRTC) and WebSocket. LiveKit is recommended for production use because it provides lower latency and native acoustic echo cancellation (AEC).
Voice Modes
| Mode | Transport | Latency | AEC | Setup |
|---|---|---|---|---|
| LiveKit | WebRTC | Low | Native platform AEC | Requires LiveKit server |
| WebSocket | Socket.IO | Higher | Manual/none | Works everywhere |
Set the mode in your EstuaryConfig asset or at runtime:
EstuaryManager.Instance.Config.VoiceMode = VoiceMode.LiveKit;
WebSocket Voice
How It Works
- Client captures audio via Unity's
MicrophoneAPI - Audio is converted to PCM 16-bit, base64-encoded, and sent as
stream_audioevents - Server runs STT (Deepgram) on the audio stream
- Server sends back
stt_responseevents with transcription - On final transcription, server generates a response and streams
bot_response+bot_voiceevents - Client decodes base64 audio and plays via
AudioSource
Setup
Add EstuaryMicrophone and EstuaryAudioSource components to your character:
// EstuaryMicrophone settings
sampleRate: 16000 // Required for STT
chunkDurationMs: 100 // 100ms audio chunks
pushToTalkKey: KeyCode.None // Always-on (or set a key)
Starting a Voice Session
// Automatically (on the EstuaryCharacter component)
autoStartVoiceSession = true;
// Or manually
character.StartVoiceSession();
StartVoiceSession() sends start_voice to the server (enabling Deepgram STT) and starts the microphone. When you call EndVoiceSession(), it sends stop_voice and stops the mic.
STT Responses
character.OnTranscript += (SttResponse stt) =>
{
if (stt.IsFinal)
{
Debug.Log($"User said: {stt.Text}");
}
else
{
// Interim result (real-time as user speaks)
transcriptLabel.text = stt.Text;
}
};
LiveKit Voice
How It Works
- After Socket.IO session is established, SDK requests a LiveKit token
- Server returns a JWT token, LiveKit URL, and room name
- SDK connects to the LiveKit room using the LiveKit Unity SDK
- SDK notifies the server it has joined (
livekit_join) - Server joins the bot to the same room (
livekit_ready) - Audio flows bidirectionally over WebRTC:
- Client publishes a microphone track (with native AEC)
- Client receives the bot's TTS audio track
Setup
LiveKit mode uses the same EstuaryMicrophone and EstuaryAudioSource components, but the audio transport is different:
- Microphone: Uses native WebRTC capture through
RtcAudioSourceinstead of Unity'sMicrophoneAPI. This enables proper AEC. - Playback: LiveKit's
AudioStreamhandles playback automatically through anAudioSourceit creates at runtime.
Connection Flow
// Auto-connect flow (recommended):
// 1. EstuaryCharacter.autoConnect = true
// 2. EstuaryCharacter.autoStartVoiceSession = true
// 3. EstuaryConfig.voiceMode = VoiceMode.LiveKit
//
// The SDK handles the full flow automatically:
// Connect → session_info → request token → join room → livekit_ready → publish mic
// Manual flow:
character.OnConnected += async (session) =>
{
// Request LiveKit token
await EstuaryManager.Instance.RequestLiveKitTokenAsync();
};
EstuaryManager.Instance.OnLiveKitReady += (roomName) =>
{
Debug.Log($"LiveKit ready in room: {roomName}");
// Audio is now flowing
};
LiveKit Connection States
Monitor the LiveKit connection via EstuaryManager:
EstuaryManager.Instance.OnLiveKitStateChanged += (state) =>
{
// LiveKitConnectionState:
// Disconnected → RequestingToken → Connecting →
// WaitingForBot → Ready
statusLabel.text = state.ToString();
};
EstuaryAudioSource
The EstuaryAudioSource component handles TTS playback with a ring buffer for smooth streaming.
WebSocket Mode
In WebSocket mode, audio arrives as base64-encoded PCM chunks in bot_voice events. The component:
- Decodes base64 to PCM16 bytes
- Converts to float samples
- Resamples if the server sample rate differs from Unity's output rate
- Writes to a ring buffer
- Plays via a streaming
AudioClipwith a PCM reader callback
The ring buffer provides gap-free playback even when network jitter causes irregular chunk delivery.
LiveKit Mode
In LiveKit mode, EstuaryAudioSource defers to LiveKit's AudioStream, which handles playback automatically. The component tracks message_id for interrupt handling.
Interrupts
When the user speaks while the bot is talking, the audio source can automatically interrupt:
// In EstuaryAudioSource Inspector:
autoInterruptOnUserSpeech = true;
microphoneRef = <your EstuaryMicrophone>
This stops playback, clears the audio buffer, and notifies the server to stop generating audio.
Events
audioSource.OnPlaybackStarted += () => { lipSync.Enable(); };
audioSource.OnPlaybackComplete += () => { lipSync.Disable(); };
audioSource.OnPlaybackInterrupted += () => { lipSync.Disable(); };
EstuaryMicrophone
Voice Activity Detection (VAD)
WebSocket mode supports client-side VAD to avoid sending silence:
// In Inspector:
useVoiceActivityDetection = true;
vadThreshold = 0.5f; // 0-1, higher = less sensitive
In LiveKit mode, VAD runs server-side via Deepgram. You can still enable client-side VAD for local speech detection (used for interrupt triggering).
Push-to-Talk
Set a key for push-to-talk mode:
// In Inspector:
pushToTalkKey = KeyCode.Space;
In LiveKit mode, holding the key unmutes the mic; releasing it mutes.
Events
microphone.OnRecordingStarted += () => { micIcon.SetActive(true); };
microphone.OnRecordingStopped += () => { micIcon.SetActive(false); };
microphone.OnVolumeChanged += (float vol) => { volumeMeter.value = vol; };
microphone.OnSpeechDetected += () => { /* user started talking */ };
microphone.OnSilenceDetected += () => { /* user stopped talking */ };
Lip Sync Integration
In LiveKit mode, the bot's AudioSource is created at runtime when the bot joins the room. To integrate lip sync:
EstuaryManager.Instance.OnBotAudioSourceCreated += (AudioSource source) =>
{
// Attach your lip sync analyzer to this AudioSource
lipSyncAnalyzer.SetAudioSource(source);
};
Next Steps
- Video Streaming -- Stream camera video for spatial awareness
- API Reference: Input Components -- Full
EstuaryMicrophonereference - API Reference: Character Components -- Full
EstuaryAudioSourcereference