Vision & Camera Protocol
Estuary characters can see the user's environment through a camera feed. This page documents the Socket.IO events for camera image capture, scene graph subscriptions, and vision configuration.
All events use the /sdk namespace. See Conversation Protocol for connection setup.
Camera Image Submission
When the SDK has a camera image to send (either proactively or in response to a server request), it emits a camera_image event.
Client → Server
Event: camera_image
Payload: {
"image": "<base64-encoded image>",
"format": "jpeg",
"request_id": "req_abc123",
"text": "What do you see?"
}
| Field | Type | Required | Description |
|---|---|---|---|
image | string | Yes | Base64-encoded image data |
format | string | No | Image format: "jpeg" (default), "png" |
request_id | string | No | Matches the request_id from a camera_capture request (if this image was server-initiated) |
text | string | No | Optional user question about the image |
The server analyzes the image using the character's configured VLM and streams a bot_response with the character's reaction.
Server-Initiated Capture
The server can request the client to capture and send a camera image. This is triggered by the character's agentic tool system -- when the LLM decides it needs to see the user's environment, it invokes a camera capture tool.
Server → Client
Event: camera_capture
Payload: {
"request_id": "req_abc123",
"text": "Can you show me what's on your desk?"
}
| Field | Type | Description |
|---|---|---|
request_id | string | Unique ID for this capture request -- include in the camera_image response |
text | string | null | The user's original question that prompted the capture |
Expected Flow
- Server emits
camera_captureto the client - Client captures an image from the device camera
- Client emits
camera_imagewith the captured image and therequest_id - Server analyzes the image and streams a
bot_response
Scene Graph Subscription
For SDKs with continuous camera feeds (AR devices), the server builds a real-time scene graph from video frames. SDKs can subscribe to scene graph updates.
Subscribe
Client → Server
Event: scene_graph_subscribe
Payload: {
"sessionId": "sid_abc123",
"enableLiveKitVideo": false
}
| Field | Type | Description |
|---|---|---|
sessionId | string | The session ID from session_info |
enableLiveKitVideo | boolean | Use LiveKit for video frame streaming (lower latency) |
Receiving Updates
Server → Client
Event: scene_graph_update
Payload: {
"sessionId": "sid_abc123",
"graph": {
"objects": [
{ "label": "coffee_mug", "confidence": 0.94, "position": [0.3, 0.5] },
{ "label": "laptop", "confidence": 0.97, "position": [0.5, 0.4] }
],
"surfaces": ["desk"],
"room": "office"
},
"timestamp": "2026-03-01T14:22:00"
}
Unsubscribe
Client → Server
Event: scene_graph_unsubscribe
Payload: { "sessionId": "sid_abc123" }
Video Frame Streaming
For continuous vision processing, SDKs send video frames directly:
Client → Server
Event: video_frame
Payload: {
"frame": "<base64-encoded image>",
"format": "jpeg",
"timestamp": 1709312520.123
}
The server processes frames for world model updates without generating a conversation response for each frame.
Next Steps
- Conversation Protocol -- Text and voice event reference
- Action Protocol -- Actions triggered from vision context
- Memory System -- How visual context is remembered