Vision & Camera Protocol
Estuary characters can see the user's environment through a camera feed. This page documents the Socket.IO events for camera image capture and vision configuration.
All events use the /sdk namespace. See Conversation Protocol for connection setup.
Camera Image Submission
When the SDK has a camera image to send (either proactively or in response to a server request), it emits a camera_image event.
Client → Server
Event: camera_image
Payload: {
"image": "<base64-encoded image>",
"mime_type": "image/jpeg",
"request_id": "req_abc123",
"text": "What do you see?"
}
| Field | Type | Required | Description |
|---|---|---|---|
image | string | Yes | Base64-encoded image data |
mime_type | string | Yes | MIME type of the image, e.g. "image/jpeg" or "image/png" |
request_id | string | No | Matches the request_id from a camera_capture request (if this image was server-initiated) |
text | string | No | Optional user question about the image |
The server analyzes the image using the character's configured VLM and streams a bot_response with the character's reaction.
Server-Initiated Capture
The server can request the client to capture and send a camera image. This is triggered by the character's agentic tool system -- when the LLM decides it needs to see the user's environment, it invokes a camera capture tool.
Server → Client
Event: camera_capture
Payload: {
"request_id": "req_abc123",
"text": "Can you show me what's on your desk?"
}
| Field | Type | Description |
|---|---|---|
request_id | string | Unique ID for this capture request -- include in the camera_image response |
text | string | null | The user's original question that prompted the capture |
Expected Flow
- Server emits
camera_captureto the client - Client captures an image from the device camera
- Client emits
camera_imagewith the captured image and therequest_id - Server analyzes the image and streams a
bot_response
Next Steps
- Conversation Protocol -- Text and voice event reference
- Action Protocol -- Actions triggered from vision context
- Memory System -- How visual context is remembered