Skip to main content

Conversation Protocol

All real-time communication between SDKs and the Estuary server happens over Socket.IO v4 on the /sdk namespace. This page documents every event in the protocol.

Transport

The protocol runs over Socket.IO v4, which uses WebSocket as its primary transport. Socket.IO adds:

  • Namespace multiplexing (all SDK events use /sdk)
  • Automatic reconnection
  • Event-based messaging with JSON payloads
  • Heartbeat/ping-pong for connection health

Text Flow

Sending Text

The client sends a text event with the message:

Client → Server
Event: text
Payload: { "text": "Hello, how are you?", "textOnly": false }
FieldTypeRequiredDescription
textstringYesThe message text
textOnlybooleanNoIf true, suppress TTS audio in the response. Default: false

Receiving Responses

The server streams the response as multiple bot_response events:

Server → Client
Event: bot_response
Payload: {
"text": "I'm doing",
"is_final": false,
"partial": true,
"message_id": "msg_abc123",
"chunk_index": 0,
"is_interjection": false
}
FieldTypeDescription
textstringText content of this chunk
is_finalbooleantrue when this is the complete response
partialbooleantrue for streaming chunks
message_idstringUnique ID for this response (use for interrupt tracking)
chunk_indexnumberSequential index of this chunk
is_interjectionbooleantrue if this is a proactive message (not a reply to user input)

Chunks arrive in order. Accumulate text from each chunk until is_final is true to build the complete response.

Voice Response

Unless textOnly was set to true, the server also streams TTS audio:

Server → Client
Event: bot_voice
Payload: {
"audio": "<base64-encoded PCM16>",
"message_id": "msg_abc123",
"chunk_index": 0,
"is_final": false
}
FieldTypeDescription
audiostringBase64-encoded PCM 16-bit audio
message_idstringMatches the corresponding bot_response message
chunk_indexnumberSequential index of this audio chunk
is_finalbooleantrue when this is the last audio chunk

The audio sample rate matches the audio_sample_rate provided during authentication.


Voice Flow: WebSocket

WebSocket voice streams microphone audio over Socket.IO for server-side speech-to-text.

Starting Voice Mode

Client → Server
Event: start_voice
Payload: null

The server responds with voice_started to confirm STT is active.

Streaming Audio

While voice mode is active, send audio chunks:

Client → Server
Event: stream_audio
Payload: { "audio": "<base64-encoded PCM16 at 16kHz>" }

Audio must be PCM 16-bit at 16,000 Hz for STT compatibility.

Receiving Transcription

Server → Client
Event: stt_response
Payload: { "text": "Hello how are you", "is_final": true }
FieldTypeDescription
textstringTranscribed text
is_finalbooleantrue when the utterance is complete

Interim results (is_final: false) arrive as the user speaks, providing real-time transcription feedback. When is_final is true, the server automatically triggers the LLM and streams bot_response + bot_voice events.

Stopping Voice Mode

Client → Server
Event: stop_voice
Payload: null

Voice Flow: LiveKit

LiveKit provides low-latency WebRTC voice with native acoustic echo cancellation (AEC). This is the preferred voice transport.

Token Request

After the Socket.IO session is established, request a LiveKit token:

Client → Server
Event: livekit_token
Payload: null

Token Response

Server → Client
Event: livekit_token
Payload: {
"token": "<JWT>",
"url": "wss://livekit.example.com",
"room": "room_abc123"
}

Joining the Room

  1. Use the LiveKit client SDK to connect to the room using the url and token
  2. Once connected, notify the server:
Client → Server
Event: livekit_join
Payload: null
  1. The server joins the bot to the same room and confirms:
Server → Client
Event: livekit_ready
Payload: { "room": "room_abc123" }

Once livekit_ready is received, audio flows bidirectionally through WebRTC. The client publishes microphone audio as a LiveKit audio track, and receives bot TTS audio as a remote audio track.

Leaving the Room

Client → Server
Event: livekit_leave
Payload: null

Errors

Server → Client
Event: livekit_error
Payload: { "error": "Failed to create room" }

Interrupts

Interrupts allow the client to stop an in-progress response (both text generation and TTS).

Sending an Interrupt

Client → Server
Event: client_interrupt
Payload: { "message_id": "msg_abc123" }

The message_id field is optional. If omitted, the server interrupts whatever is currently being generated.

Receiving an Interrupt

Server → Client
Event: interrupt
Payload: {
"message_id": "msg_abc123",
"reason": "user_interrupt",
"interrupted_at": "1706000000.123"
}

When an interrupt is received:

  1. Stop any audio playback for the interrupted message
  2. Clear the current response buffer
  3. Discard any remaining bot_response and bot_voice chunks for the interrupted message_id

Audio Playback Tracking

Notify the server when TTS audio finishes playing. This helps the server manage conversation timing (e.g., knowing when the character is done speaking before the user responds).

Client → Server
Event: audio_playback_complete
Payload: { "message_id": "msg_abc123" }

Quota Exceeded

If the API key owner has exceeded their monthly interaction quota:

Server → Client
Event: quota_exceeded
Payload: {
"message": "Monthly interaction limit reached",
"current": 1000,
"limit": 1000,
"remaining": 0,
"tier": "free"
}

The connection may be terminated after this event. SDKs should display an appropriate message to the user.


Error Events

General server errors:

Server → Client
Event: error
Payload: { "message": "Internal server error" }

Preferences

Update session-level preferences:

Client → Server
Event: update_preferences
Payload: { "enableVisionAcknowledgment": true }
FieldTypeDescription
enableVisionAcknowledgmentbooleanWhen true, the character says an acknowledgment before analyzing a camera image

Next Steps