Skip to main content

Vision & Camera Protocol

Estuary characters can see the user's environment through a camera feed. This page documents the Socket.IO events for camera image capture and vision configuration.

All events use the /sdk namespace. See Conversation Protocol for connection setup.


Camera Image Submission

When the SDK has a camera image to send (either proactively or in response to a server request), it emits a camera_image event.

Client → Server

Event: camera_image
Payload: {
"image": "<base64-encoded image>",
"mime_type": "image/jpeg",
"request_id": "req_abc123",
"text": "What do you see?"
}
FieldTypeRequiredDescription
imagestringYesBase64-encoded image data
mime_typestringYesMIME type of the image, e.g. "image/jpeg" or "image/png"
request_idstringNoMatches the request_id from a camera_capture request (if this image was server-initiated)
textstringNoOptional user question about the image

The server analyzes the image using the character's configured VLM and streams a bot_response with the character's reaction.


Server-Initiated Capture

The server can request the client to capture and send a camera image. This is triggered by the character's agentic tool system -- when the LLM decides it needs to see the user's environment, it invokes a camera capture tool.

Server → Client

Event: camera_capture
Payload: {
"request_id": "req_abc123",
"text": "Can you show me what's on your desk?"
}
FieldTypeDescription
request_idstringUnique ID for this capture request -- include in the camera_image response
textstring | nullThe user's original question that prompted the capture

Expected Flow

  1. Server emits camera_capture to the client
  2. Client captures an image from the device camera
  3. Client emits camera_image with the captured image and the request_id
  4. Server analyzes the image and streams a bot_response

Next Steps