Conversation Protocol

All real-time communication between SDKs and the Estuary server happens over Socket.IO v4 on the /sdk namespace. This page documents every event in the protocol.

Transport

The protocol runs over Socket.IO v4, which uses WebSocket as its primary transport. Socket.IO adds:

Namespace multiplexing (all SDK events use /sdk)
Automatic reconnection
Event-based messaging with JSON payloads
Heartbeat/ping-pong for connection health

Text Flow

Sending Text

The client sends a text event with the message:

Client → Server
Event: text
Payload: { "text": "Hello, how are you?", "textOnly": false }

Field	Type	Required	Description
`text`	string	Yes	The message text
`textOnly`	boolean	No	If `true`, suppress TTS audio in the response. Default: `false`

Receiving Responses

The server streams the response as multiple bot_response events:

Server → Client
Event: bot_response
Payload: {
  "text": "I'm doing",
  "is_final": false,
  "partial": true,
  "message_id": "msg_abc123",
  "chunk_index": 0,
  "is_interjection": false
}

Field	Type	Description
`text`	string	Text content of this chunk
`is_final`	boolean	`true` when this is the complete response
`partial`	boolean	`true` for streaming chunks
`message_id`	string	Unique ID for this response (use for interrupt tracking)
`chunk_index`	number	Sequential index of this chunk
`is_interjection`	boolean	`true` if this is a proactive message (not a reply to user input)

Chunks arrive in order. Accumulate text from each chunk until is_final is true to build the complete response.

Voice Response

Unless textOnly was set to true, the server also streams TTS audio:

Server → Client
Event: bot_voice
Payload: {
  "audio": "<base64-encoded PCM16>",
  "message_id": "msg_abc123",
  "chunk_index": 0,
  "is_final": false
}

Field	Type	Description
`audio`	string	Base64-encoded PCM 16-bit audio
`message_id`	string	Matches the corresponding `bot_response` message
`chunk_index`	number	Sequential index of this audio chunk
`is_final`	boolean	`true` when this is the last audio chunk

The audio sample rate matches the audio_sample_rate provided during authentication.

Scripted Lines

The say_line event makes the character speak a specific prewritten line. The server skips the LLM entirely and feeds the text straight to TTS. Use this for cutscenes, branching dialogue, tutorials, or any moment where you want exact control over what the character says.

Sending a Scripted Line

Client → Server
Event: say_line
Payload: { "text": "Welcome to the jungle.", "text_only": false }

Field	Type	Required	Description
`text`	string	Yes	The exact line the character should say
`text_only`	boolean	No	If `true`, skip TTS and only emit `bot_response`. Default: `false`

Behavior

Any in-progress response is interrupted before the scripted line is processed.
The response flows back through the standard bot_response + bot_voice events. is_interjection is always false.
The scripted line is saved to chat history as a normal assistant message, so the LLM and memory system see it as something the character naturally said.
When text_only is true, only bot_response is emitted (no audio).

Voice Flow: WebSocket

WebSocket voice streams microphone audio over Socket.IO for server-side speech-to-text.

Starting Voice Mode

Client → Server
Event: start_voice
Payload: null

Streaming Audio

While voice mode is active, send audio chunks:

Client → Server
Event: stream_audio
Payload: { "audio": "<base64-encoded PCM16>" }

Audio is PCM 16-bit. The expected sample rate is whatever the SDK negotiated with the device microphone -- the server normalizes input for the speech-to-text engine. SDK implementations document the recommended capture rate for the platform.

Receiving Transcription

Server → Client
Event: stt_response
Payload: { "text": "Hello how are you", "is_final": true }

Field	Type	Description
`text`	string	Transcribed text
`is_final`	boolean	`true` when the utterance is complete

Interim results (is_final: false) arrive as the user speaks, providing real-time transcription feedback. When is_final is true, the server automatically triggers the LLM and streams bot_response + bot_voice events.

Stopping Voice Mode

Client → Server
Event: stop_voice
Payload: null

Push-to-Talk

Push-to-talk (PTT) reuses the same start_voice / stream_audio / stop_voice events as the WebSocket voice flow above, but the SDK explicitly controls the recording window rather than relying on continuous voice-activity detection.

The server treats stop_voice as a definitive end-of-turn signal. After stop_voice, the server processes the accumulated transcript and responds via the standard bot_response + bot_voice flow.

Typical PTT Sequence

Client → Server   start_voice   (user pressed talk button)
Client → Server   stream_audio  (while button is held)
Client → Server   stream_audio
...
Client → Server   stop_voice    (user released button)
Server → Client   stt_response  ({is_final: true})
Server → Client   bot_response  (streaming)
Server → Client   bot_voice     (streaming)

When to use PTT

Noisy environments where VAD generates false starts
Walkie-talkie style UX
Latency-critical UX where the user wants to end their turn deterministically

PTT works with both WebSocket voice and LiveKit voice.

Voice Flow: LiveKit

LiveKit provides low-latency WebRTC voice with native acoustic echo cancellation (AEC). This is the preferred voice transport.

Token Request

After the Socket.IO session is established, request a LiveKit token:

Client → Server
Event: livekit_token
Payload: null

Token Response

Server → Client
Event: livekit_token
Payload: {
  "token": "<JWT>",
  "url": "wss://livekit.example.com",
  "room": "room_abc123"
}

Joining the Room

Use the LiveKit client SDK to connect to the room using the url and token
Once connected, notify the server:

Client → Server
Event: livekit_join
Payload: null

The server joins the bot to the same room and confirms:

Server → Client
Event: livekit_ready
Payload: { "room": "room_abc123" }

Once livekit_ready is received, audio flows bidirectionally through WebRTC. The client publishes microphone audio as a LiveKit audio track, and receives bot TTS audio as a remote audio track.

Leaving the Room

Client → Server
Event: livekit_leave
Payload: null

Errors

Server → Client
Event: livekit_error
Payload: { "error": "Failed to create room" }

Interrupts

Interrupts allow the client to stop an in-progress response (both text generation and TTS).

Sending an Interrupt

Client → Server
Event: client_interrupt
Payload: { "message_id": "msg_abc123" }

The message_id field is optional. If omitted, the server interrupts whatever is currently being generated.

Receiving an Interrupt

Server → Client
Event: interrupt
Payload: {
  "message_id": "msg_abc123",
  "reason": "user_interrupt",
  "interrupted_at": "1706000000.123"
}

When an interrupt is received:

Stop any audio playback for the interrupted message
Clear the current response buffer
Discard any remaining bot_response and bot_voice chunks for the interrupted message_id

Audio Playback Tracking

Notify the server when TTS audio finishes playing. This helps the server manage conversation timing (e.g., knowing when the character is done speaking before the user responds).

Client → Server
Event: audio_playback_complete
Payload: { "message_id": "msg_abc123" }

Quota Exceeded

If the API key owner has exceeded their monthly interaction quota:

Server → Client
Event: quota_exceeded
Payload: {
  "message": "Monthly interaction limit reached",
  "current": 1000,
  "limit": 1000,
  "remaining": 0,
  "tier": "free"
}

The connection may be terminated after this event. SDKs should display an appropriate message to the user.

Error Events

General server errors:

Server → Client
Event: error
Payload: { "message": "Internal server error" }

Preferences

Update session-level preferences:

Client → Server
Event: update_preferences
Payload: { "enableVisionAcknowledgment": true }

Field	Type	Description
`enableVisionAcknowledgment`	boolean	When `true`, the character says an acknowledgment before analyzing a camera image

Memory Events

memory_updated

After each bot response, the server extracts memories from the conversation in the background. When extraction completes, the server emits a memory_updated event with the results.

:::info Opt-in Memory events require the client to enable realtime_memory (or realtimeMemory in TypeScript) in its SDK configuration. Without this flag, memory extraction still happens but no notification is sent. :::

Server → Client
Event: memory_updated
Payload: {
  "agent_id": "550e8400-e29b-41d4-a716-446655440000",
  "player_id": "player_abc123",
  "memories_extracted": 3,
  "facts_extracted": 1,
  "conversation_id": "conv_def456",
  "new_memories": [
    {
      "id": "mem_xyz",
      "memoryType": "fact",
      "content": "The user works as a software engineer",
      "confidence": 0.92,
      "status": "active",
      "source": "text_chat"
    }
  ],
  "timestamp": "2026-03-01T14:22:00Z"
}

Field	Type	Description
`agent_id`	string	Character (agent) ID
`player_id`	string	Player ID
`memories_extracted`	number	Number of memories extracted this turn
`facts_extracted`	number	Number of core facts extracted or updated
`conversation_id`	string	Conversation ID
`new_memories`	array	Array of newly extracted memory objects (camelCase keys)
`timestamp`	string	ISO 8601 timestamp

Memory extraction happens asynchronously after the bot response is sent. The memory_updated event typically arrives 1-3 seconds after the corresponding bot_response with is_final: true.

If the client disconnects before extraction completes, the notification is dropped. The client can query current memories via the Memories REST API on reconnect.

Next Steps

Memory System -- How conversation memory works
Action Protocol -- Actions embedded in bot responses

Transport​

Text Flow​

Sending Text​

Receiving Responses​

Voice Response​

Scripted Lines​

Sending a Scripted Line​

Behavior​

Voice Flow: WebSocket​

Starting Voice Mode​

Streaming Audio​

Receiving Transcription​

Stopping Voice Mode​

Push-to-Talk​

Typical PTT Sequence​

When to use PTT​

Voice Flow: LiveKit​

Token Request​

Token Response​

Joining the Room​

Leaving the Room​

Errors​

Interrupts​

Sending an Interrupt​

Receiving an Interrupt​

Audio Playback Tracking​

Quota Exceeded​

Error Events​

Preferences​

Memory Events​

memory_updated​

Next Steps​