Conversation Protocol
All real-time communication between SDKs and the Estuary server happens over Socket.IO v4 on the /sdk namespace. This page documents every event in the protocol.
Transport
The protocol runs over Socket.IO v4, which uses WebSocket as its primary transport. Socket.IO adds:
- Namespace multiplexing (all SDK events use
/sdk) - Automatic reconnection
- Event-based messaging with JSON payloads
- Heartbeat/ping-pong for connection health
Text Flow
Sending Text
The client sends a text event with the message:
Client → Server
Event: text
Payload: { "text": "Hello, how are you?", "textOnly": false }
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | The message text |
textOnly | boolean | No | If true, suppress TTS audio in the response. Default: false |
Receiving Responses
The server streams the response as multiple bot_response events:
Server → Client
Event: bot_response
Payload: {
"text": "I'm doing",
"is_final": false,
"partial": true,
"message_id": "msg_abc123",
"chunk_index": 0,
"is_interjection": false
}
| Field | Type | Description |
|---|---|---|
text | string | Text content of this chunk |
is_final | boolean | true when this is the complete response |
partial | boolean | true for streaming chunks |
message_id | string | Unique ID for this response (use for interrupt tracking) |
chunk_index | number | Sequential index of this chunk |
is_interjection | boolean | true if this is a proactive message (not a reply to user input) |
Chunks arrive in order. Accumulate text from each chunk until is_final is true to build the complete response.
Voice Response
Unless textOnly was set to true, the server also streams TTS audio:
Server → Client
Event: bot_voice
Payload: {
"audio": "<base64-encoded PCM16>",
"message_id": "msg_abc123",
"chunk_index": 0,
"is_final": false
}
| Field | Type | Description |
|---|---|---|
audio | string | Base64-encoded PCM 16-bit audio |
message_id | string | Matches the corresponding bot_response message |
chunk_index | number | Sequential index of this audio chunk |
is_final | boolean | true when this is the last audio chunk |
The audio sample rate matches the audio_sample_rate provided during authentication.
Voice Flow: WebSocket
WebSocket voice streams microphone audio over Socket.IO for server-side speech-to-text.
Starting Voice Mode
Client → Server
Event: start_voice
Payload: null
The server responds with voice_started to confirm STT is active.
Streaming Audio
While voice mode is active, send audio chunks:
Client → Server
Event: stream_audio
Payload: { "audio": "<base64-encoded PCM16 at 16kHz>" }
Audio must be PCM 16-bit at 16,000 Hz for STT compatibility.
Receiving Transcription
Server → Client
Event: stt_response
Payload: { "text": "Hello how are you", "is_final": true }
| Field | Type | Description |
|---|---|---|
text | string | Transcribed text |
is_final | boolean | true when the utterance is complete |
Interim results (is_final: false) arrive as the user speaks, providing real-time transcription feedback. When is_final is true, the server automatically triggers the LLM and streams bot_response + bot_voice events.
Stopping Voice Mode
Client → Server
Event: stop_voice
Payload: null
Voice Flow: LiveKit
LiveKit provides low-latency WebRTC voice with native acoustic echo cancellation (AEC). This is the preferred voice transport.
Token Request
After the Socket.IO session is established, request a LiveKit token:
Client → Server
Event: livekit_token
Payload: null
Token Response
Server → Client
Event: livekit_token
Payload: {
"token": "<JWT>",
"url": "wss://livekit.example.com",
"room": "room_abc123"
}
Joining the Room
- Use the LiveKit client SDK to connect to the room using the
urlandtoken - Once connected, notify the server:
Client → Server
Event: livekit_join
Payload: null
- The server joins the bot to the same room and confirms:
Server → Client
Event: livekit_ready
Payload: { "room": "room_abc123" }
Once livekit_ready is received, audio flows bidirectionally through WebRTC. The client publishes microphone audio as a LiveKit audio track, and receives bot TTS audio as a remote audio track.
Leaving the Room
Client → Server
Event: livekit_leave
Payload: null
Errors
Server → Client
Event: livekit_error
Payload: { "error": "Failed to create room" }
Interrupts
Interrupts allow the client to stop an in-progress response (both text generation and TTS).
Sending an Interrupt
Client → Server
Event: client_interrupt
Payload: { "message_id": "msg_abc123" }
The message_id field is optional. If omitted, the server interrupts whatever is currently being generated.
Receiving an Interrupt
Server → Client
Event: interrupt
Payload: {
"message_id": "msg_abc123",
"reason": "user_interrupt",
"interrupted_at": "1706000000.123"
}
When an interrupt is received:
- Stop any audio playback for the interrupted message
- Clear the current response buffer
- Discard any remaining
bot_responseandbot_voicechunks for the interruptedmessage_id
Audio Playback Tracking
Notify the server when TTS audio finishes playing. This helps the server manage conversation timing (e.g., knowing when the character is done speaking before the user responds).
Client → Server
Event: audio_playback_complete
Payload: { "message_id": "msg_abc123" }
Quota Exceeded
If the API key owner has exceeded their monthly interaction quota:
Server → Client
Event: quota_exceeded
Payload: {
"message": "Monthly interaction limit reached",
"current": 1000,
"limit": 1000,
"remaining": 0,
"tier": "free"
}
The connection may be terminated after this event. SDKs should display an appropriate message to the user.
Error Events
General server errors:
Server → Client
Event: error
Payload: { "message": "Internal server error" }
Preferences
Update session-level preferences:
Client → Server
Event: update_preferences
Payload: { "enableVisionAcknowledgment": true }
| Field | Type | Description |
|---|---|---|
enableVisionAcknowledgment | boolean | When true, the character says an acknowledgment before analyzing a camera image |
Next Steps
- Memory System -- How conversation memory works
- Action Protocol -- Actions embedded in bot responses