Skip to main content

Voice Connection

Learn how to implement real-time voice conversations with AI characters using the Estuary SDK.

Overview

The Estuary SDK provides full duplex voice communication:

  • Voice Input: Capture user speech via microphone → Speech-to-Text (Deepgram)
  • Voice Output: AI responses → Text-to-Speech (ElevenLabs) → Audio playback
┌─────────────────────────────────────────────────────────────────┐
│ Voice Conversation Flow │
├─────────────────────────────────────────────────────────────────┤
│ │
│ USER SPEAKS │
│ ┌──────────┐ ┌─────────────┐ ┌───────────────────────┐ │
│ │ Mic │---→│ SDK Encodes │---→│ Estuary Server │ │
│ │ (16kHz) │ │ (Base64) │ │ ┌─────────────────┐ │ │
│ └──────────┘ └─────────────┘ │ │ Deepgram (STT) │ │ │
│ │ └────────┬────────┘ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ AI RESPONDS │ │ AI Character │ │ │
│ ┌──────────┐ ┌─────────────┐ │ └────────┬────────┘ │ │
│ │ Speaker │←---│ SDK Decodes │←---│ ▼ │ │
│ │ (24kHz) │ │ (Base64) │ │ ┌─────────────────┐ │ │
│ └──────────┘ └─────────────┘ │ │ ElevenLabs TTS │ │ │
│ │ └─────────────────┘ │ │
│ └───────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Quick Start with SimpleAutoConnect

The easiest way to implement voice is using the SimpleAutoConnect component:

Step 1: Scene Setup

Create these SceneObjects:

Object NameComponents
Estuary CredentialsEstuaryCredentials script
Estuary ConnectionSimpleAutoConnect script
MicrophoneMicrophoneRecorder script
Audio OutputDynamicAudioOutput script + AudioComponent
Internet ModuleInternetModule resource

Step 2: Configure SimpleAutoConnect

In the Inspector, connect:

credentialsObject    → Estuary Credentials
internetModule → Internet Module
microphoneRecorderObject → Microphone
dynamicAudioOutputObject → Audio Output

That's It!

SimpleAutoConnect handles:

  • Connection management
  • Microphone streaming
  • Voice Activity Detection (server-side via Deepgram)
  • Audio playback
  • Interrupt handling
  • Auto-reconnection

Manual Voice Implementation

For more control, you can implement voice manually.

Set Up the Character

import { EstuaryCharacter } from 'estuary-lens-studio-sdk';
import { EstuaryConfig } from 'estuary-lens-studio-sdk';
import { setInternetModule } from 'estuary-lens-studio-sdk';

@component
export class VoiceController extends BaseScriptComponent {

@input
internetModule: InternetModule;

private character: EstuaryCharacter;

onAwake() {
// REQUIRED: Set up InternetModule for WebSocket
setInternetModule(this.internetModule);

// Create character
this.character = new EstuaryCharacter(
"your-character-id",
"unique-player-id"
);

// Configure and connect
const config: EstuaryConfig = {
serverUrl: "wss://api.estuary-ai.com",
apiKey: "your-api-key",
characterId: "your-character-id",
playerId: "unique-player-id",
debugLogging: true
};

this.character.initialize(config);
}
}

Set Up Microphone Input

import { EstuaryMicrophone, MicrophoneRecorder } from 'estuary-lens-studio-sdk';

@component
export class VoiceController extends BaseScriptComponent {

@input
microphoneRecorderObject: SceneObject;

private microphone: EstuaryMicrophone;

onAwake() {
// ... character setup ...

// Create microphone handler
this.microphone = new EstuaryMicrophone(this.character);
this.microphone.debugLogging = true;

// Find MicrophoneRecorder on the SceneObject
const recorder = this.findMicrophoneRecorder(this.microphoneRecorderObject);
if (recorder) {
this.microphone.setMicrophoneRecorder(recorder);
}

// Connect microphone to character
this.character.microphone = this.microphone;
}

private findMicrophoneRecorder(obj: SceneObject): MicrophoneRecorder | null {
const count = obj.getComponentCount("Component.ScriptComponent");
for (let i = 0; i < count; i++) {
const comp = obj.getComponentByIndex("Component.ScriptComponent", i) as any;
if (comp?.onAudioFrame && typeof comp.startRecording === 'function') {
return comp as MicrophoneRecorder;
}
}
return null;
}
}

Handle Voice Events

private setupEventHandlers() {
// When connected, start voice session
this.character.on('connected', (session) => {
print(`Connected: ${session.sessionId}`);

// IMPORTANT: Start voice session before streaming audio
this.character.startVoiceSession();

// Start microphone
this.microphone.startRecording();
});

// Handle transcription (what user said)
this.character.on('transcript', (stt) => {
if (stt.isFinal) {
print(`[You] ${stt.text}`);
}
});

// Handle AI text response
this.character.on('botResponse', (response) => {
if (response.isFinal) {
print(`[AI] ${response.text}`);
}
});

// Handle AI voice response
this.character.on('voiceReceived', (voice) => {
// Play audio via DynamicAudioOutput
this.playVoiceAudio(voice);
});

// Handle interrupts (user speaks while AI is talking)
this.character.on('interrupt', () => {
// Stop current audio playback
this.stopAudioPlayback();
});
}

Audio Playback

interface DynamicAudioOutput {
initialize(sampleRate: number): void;
addAudioFrame(uint8Array: Uint8Array, channels: number): void;
interruptAudioOutput(): void;
}

private dynamicAudioOutput: DynamicAudioOutput;
private audioInitialized: boolean = false;

private setupAudioOutput(audioOutputObject: SceneObject) {
// Find DynamicAudioOutput component
const count = audioOutputObject.getComponentCount("Component.ScriptComponent");
for (let i = 0; i < count; i++) {
const comp = audioOutputObject.getComponentByIndex("Component.ScriptComponent", i) as any;
if (comp?.initialize && comp?.addAudioFrame) {
this.dynamicAudioOutput = comp;
break;
}
}

if (this.dynamicAudioOutput) {
// Initialize with TTS sample rate (24kHz for ElevenLabs)
this.dynamicAudioOutput.initialize(24000);
this.audioInitialized = true;
}
}

private playVoiceAudio(voice: BotVoice) {
if (!this.dynamicAudioOutput || !voice.audio) return;

// Decode Base64 to PCM bytes
const pcmBytes = Base64.decode(voice.audio);

// Play audio (mono = 1 channel)
this.dynamicAudioOutput.addAudioFrame(pcmBytes, 1);
}

private stopAudioPlayback() {
if (this.dynamicAudioOutput) {
this.dynamicAudioOutput.interruptAudioOutput();
}
}

Voice Session Management

Starting a Voice Session

You must start a voice session before streaming audio:

character.on('connected', (session) => {
// This enables audio streaming to the server
character.startVoiceSession();

// Now you can start recording
microphone.startRecording();
});

Ending a Voice Session

// Stop voice input (but keep connection open)
character.endVoiceSession();
microphone.stopRecording();

Voice Session State

// Check if voice session is active
if (character.isVoiceSessionActive) {
// Audio streaming is enabled
}

Handling Interrupts

When the user speaks while the AI is responding, an interrupt is triggered:

character.on('interrupt', (data) => {
// 1. Stop audio playback immediately
dynamicAudioOutput.interruptAudioOutput();

// 2. Clear any pending response text
// 3. Update UI to show user is speaking

print("User interrupted - stopping AI audio");
});

The server automatically:

  • Stops generating the current response
  • Clears the TTS queue
  • Starts processing the new user input

Text-Only Fallback

You can also send text messages without voice:

// Send text directly to the AI
character.sendText("Hello, how are you?");

// Listen for text response
character.on('botResponse', (response) => {
if (response.isFinal) {
print(`AI responded: ${response.text}`);
}
});

Audio Format Details

Input Audio (Microphone to Server)

PropertyValue
Sample Rate16,000 Hz
Format16-bit PCM (signed, little-endian)
ChannelsMono (1)
EncodingBase64 string
Chunk Size~100ms of audio

Output Audio (Server to Speaker)

PropertyValue
Sample Rate24,000 Hz (ElevenLabs default)
Format16-bit PCM (signed, little-endian)
ChannelsMono (1)
EncodingBase64 string

Troubleshooting

No Audio Being Sent

WARNING: Audio dropped: voice session not active! Call startVoiceSession() first.

Solution: Call character.startVoiceSession() after connection before recording.

No Audio Playback

  1. Ensure DynamicAudioOutput has an AudioComponent attached
  2. Verify an AudioTrack asset is assigned
  3. Check initialize() was called with correct sample rate (24000)

Audio Cuts Out

The SDK includes throttling to prevent WebSocket buffer overflow. If audio still cuts out:

  • Check network stability
  • Reduce other network traffic in your Lens

WebSocket NOT available in Lens Studio Preview

This is expected - WebSocket only works on actual Spectacles hardware. Deploy to device for testing.


Best Practices

Always Initialize Audio Early

// Initialize audio output immediately after finding the component
dynamicAudioOutput.initialize(24000);

Handle Connection Loss

character.on('disconnected', () => {
microphone.stopRecording();
// Show UI indicator
});

character.on('error', (error) => {
print(`Voice error: ${error}`);
// Attempt recovery or show message
});

Provide Visual Feedback

Users benefit from knowing:

  • When their voice is being captured
  • When the AI is "thinking"
  • When the AI is speaking

Respect Voice Sessions

// Don't stream audio without an active session
if (character.isConnected && character.isVoiceSessionActive) {
character.streamAudio(audioBase64);
}

Next Steps