- Rokid WebRTC: Android camera streaming, SDP signaling, data channels, ICE, and Python
aiortcreceiver setup. - OpenAI Realtime: provider-specific realtime model wiring for backend-augmented vision, image insertion after user audio turns, sideband events, and transcripts.
- Rokid Inputs: Rokid camera constraints and touchpad/debug controls.
Architecture
The common object-detection shape is:- Android captures a low-rate outward-camera stream.
- Android sends video to a backend vision endpoint over WebRTC.
- The backend receives video, runs detection on the latest useful frame, and normalizes model output.
- The backend publishes app events to Android over a data channel or control WebSocket.
- The backend optionally stores the latest annotated JPEG for inspection, debugging, or realtime model image augmentation.
Android stream
Use a separate camera WebRTC session when detection is not the main realtime media path. Start with the lowest supported capture mode that still supports the detector; on Rokid, use1024x768 @ 15 fps capture and throttle detection or WebRTC output lower when needed. For glasses apps, freshness and stability usually matter more than visual smoothness.
If the requested mode is not supported by the camera HAL, start capture with a supported mode and let WebRTC adapt the outgoing stream. Create any data channel before the offer if detection events need to move over the same peer connection.
Send explicit app events such as session.start, run.start, debug.step, or workflow.confirm. Queue client events until the channel or control socket is open.
Backend receiver
The backend can be any stack that can receive media and run inference. In Python, useaiortc for WebRTC termination instead of hand-rolled SDP or media parsing. See Rokid WebRTC for the receiver shape.
Keep the receiver thin: accept the media stream, hand frames to a vision processor, and publish normalized app events. Keep session lifecycle, cleanup, and state broadcasting outside the detector model wrapper so the model can be swapped later.
Close peer connections on failed, closed, or disconnected states, and clear channel/session state when the stream ends. If the app supports only one active vision stream, close existing peer connections and clear channel/session state before accepting a new offer so stale detections cannot affect the next run. Prefer H264 when available because it is a practical codec choice for Rokid camera streaming.
Frame policy
Object detection should optimize freshness, not throughput. A camera-glasses app that reacts to stale frames feels wrong even when inference is accurate. Use one of these policies:- Latest-frame buffer: keep only the newest frame while inference runs. This works well when the model is slower than the camera stream.
- Minimum interval: skip frames until
now - last_processed >= min_interval_s. This is simple and works well for image augmentation. - One in-flight inference: if a frame is being processed, drop incoming frames instead of building a queue.
Normalized results
Normalize every detector into a small app-owned structure before any workflow or client code sees it:- Map provider labels to domain labels on the backend.
- Include confidence and timestamp if downstream logic needs stability checks.
- Make the bounding-box convention explicit.
box_xyxyshould mean left, top, right, bottom in source-image pixels; use a_normsuffix or acoordinate_spacefield for normalized boxes. - Prefer a list of detection objects over parallel
labels,boxes, andconfidencesarrays unless the app already has a strict schema for parallel arrays. - Keep raw predictions available only in logs or debug traces.
- Use stable event types and field names. Android should ignore unknown fields, but it should not need provider-specific parsing.
Model backends
Choose the model by the behavior you need:- Fine-tuned object detector: best for a known set of physical objects, parts, states, or completion markers.
- Open-vocabulary detector: useful during prototyping, but stabilize labels before wiring completion rules to them.
- Hosted detector service: fastest to prototype, but normalize results and hide vendor auth from Android.
Decision logic
Do not let a single detection immediately mutate important user-visible state unless the workflow truly tolerates false positives. Add a confirmation rule between normalized detections and app state. A two-hit rule works well for simple glasses demos:- Presence over time: require a class to appear for N frames or M milliseconds.
- Rising edge count: count false-to-true transitions, useful for repeated actions.
- Best-confidence match: choose the highest-confidence object among allowed labels.
- Region rule: require the object box to be inside a known image region.
- Generation match: ignore detector results from an old task generation after the backend switches tasks.
Event contracts
Use a stable channel or socket contract for vision state. A WebRTC data channel such asvision-events is a good fit when the camera peer connection already exists and events only matter while that stream is live.
Common backend-to-Android events:
config: detector labels, workflow steps, or task metadata.state: normalized status, active task, counters, and latest detection summary.detection: optional debug-only detection snapshot.- Domain event, such as
split_completed,task.completed, orworkflow.done.
session.startorrun.start.debug.stepwith a direction or target id.workflow.confirmfor explicit user confirmation.session.stopwhen the app exits.
Annotated frames
Annotated frames are useful for debugging, model tuning, and realtime model augmentation. Usesupervision, OpenCV, PIL, or the detector library’s own helpers to draw boxes and labels.
Save:
latest.jpgfor quick inspection and downstream image augmentation.- A bounded timestamped history when debugging regressions.
Realtime model augmentation
Object detection can provide either structured context or the latest annotated image to a realtime model:- Use structured text when labels, counts, or state are enough.
- Use an annotated
input_imagewhen spatial layout, part appearance, or visual ambiguity matters.
Training and tuning
Train and evaluate from the glasses point of view. Record representative Rokid camera footage, including bad lighting, hand occlusion, motion blur, partial objects, and the distances users actually work at. Practical tuning loop:- Start with a small label set that maps directly to app decisions.
- Capture annotated frame history while using the app.
- Review false positives and missed detections from
latest.jpgplus history frames. - Adjust labels, thresholds, and confirmation rules before changing client or workflow logic.
- Add manual debug controls so the app remains testable when the detector is wrong.