Skip to main content

Vision & Camera Protocol

Estuary characters can see the user's environment through a camera feed. This page documents the Socket.IO events for camera image capture, scene graph subscriptions, and vision configuration.

All events use the /sdk namespace. See Conversation Protocol for connection setup.


Camera Image Submission

When the SDK has a camera image to send (either proactively or in response to a server request), it emits a camera_image event.

Client → Server

Event: camera_image
Payload: {
"image": "<base64-encoded image>",
"format": "jpeg",
"request_id": "req_abc123",
"text": "What do you see?"
}
FieldTypeRequiredDescription
imagestringYesBase64-encoded image data
formatstringNoImage format: "jpeg" (default), "png"
request_idstringNoMatches the request_id from a camera_capture request (if this image was server-initiated)
textstringNoOptional user question about the image

The server analyzes the image using the character's configured VLM and streams a bot_response with the character's reaction.


Server-Initiated Capture

The server can request the client to capture and send a camera image. This is triggered by the character's agentic tool system -- when the LLM decides it needs to see the user's environment, it invokes a camera capture tool.

Server → Client

Event: camera_capture
Payload: {
"request_id": "req_abc123",
"text": "Can you show me what's on your desk?"
}
FieldTypeDescription
request_idstringUnique ID for this capture request -- include in the camera_image response
textstring | nullThe user's original question that prompted the capture

Expected Flow

  1. Server emits camera_capture to the client
  2. Client captures an image from the device camera
  3. Client emits camera_image with the captured image and the request_id
  4. Server analyzes the image and streams a bot_response

Scene Graph Subscription

For SDKs with continuous camera feeds (AR devices), the server builds a real-time scene graph from video frames. SDKs can subscribe to scene graph updates.

Subscribe

Client → Server
Event: scene_graph_subscribe
Payload: {
"sessionId": "sid_abc123",
"enableLiveKitVideo": false
}
FieldTypeDescription
sessionIdstringThe session ID from session_info
enableLiveKitVideobooleanUse LiveKit for video frame streaming (lower latency)

Receiving Updates

Server → Client
Event: scene_graph_update
Payload: {
"sessionId": "sid_abc123",
"graph": {
"objects": [
{ "label": "coffee_mug", "confidence": 0.94, "position": [0.3, 0.5] },
{ "label": "laptop", "confidence": 0.97, "position": [0.5, 0.4] }
],
"surfaces": ["desk"],
"room": "office"
},
"timestamp": "2026-03-01T14:22:00"
}

Unsubscribe

Client → Server
Event: scene_graph_unsubscribe
Payload: { "sessionId": "sid_abc123" }

Video Frame Streaming

For continuous vision processing, SDKs send video frames directly:

Client → Server
Event: video_frame
Payload: {
"frame": "<base64-encoded image>",
"format": "jpeg",
"timestamp": 1709312520.123
}

The server processes frames for world model updates without generating a conversation response for each frame.


Next Steps