Vision & Camera Protocol

Estuary characters can see the user's environment through a camera feed. This page documents the Socket.IO events for camera image capture and vision configuration.

All events use the /sdk namespace. See Conversation Protocol for connection setup.

Camera Image Submission

When the SDK has a camera image to send (either proactively or in response to a server request), it emits a camera_image event.

Client → Server

Event: camera_image
Payload: {
  "image": "<base64-encoded image>",
  "mime_type": "image/jpeg",
  "request_id": "req_abc123",
  "text": "What do you see?"
}

Field	Type	Required	Description
`image`	string	Yes	Base64-encoded image data
`mime_type`	string	Yes	MIME type of the image, e.g. `"image/jpeg"` or `"image/png"`
`request_id`	string	No	Matches the `request_id` from a `camera_capture` request (if this image was server-initiated)
`text`	string	No	Optional user question about the image

The server analyzes the image using the character's configured VLM and streams a bot_response with the character's reaction.

Server-Initiated Capture

The server can request the client to capture and send a camera image. This is triggered by the character's agentic tool system -- when the LLM decides it needs to see the user's environment, it invokes a camera capture tool.

Server → Client

Event: camera_capture
Payload: {
  "request_id": "req_abc123",
  "text": "Can you show me what's on your desk?"
}

Field	Type	Description
`request_id`	string	Unique ID for this capture request -- include in the `camera_image` response
`text`	string \| null	The user's original question that prompted the capture

Expected Flow

Server emits camera_capture to the client
Client captures an image from the device camera
Client emits camera_image with the captured image and the request_id
Server analyzes the image and streams a bot_response

Next Steps

Conversation Protocol -- Text and voice event reference
Action Protocol -- Actions triggered from vision context
Memory System -- How visual context is remembered

Camera Image Submission​

Client → Server​

Server-Initiated Capture​

Server → Client​

Expected Flow​

Next Steps​