- Rokid WebRTC: media streaming and data channel setup.
- OpenAI Realtime: realtime model turns, backend-controlled speech, and sideband control.
- Object Detection: detector events, confirmation rules, and detection-driven task progression.
Core loop
Perception loop
The perception loop can use any provider that turns live context into observations:- continuous VLM inference, such as Overshoot
- object detection
- OCR
- barcode or marker detection
- hand, pose, or gesture detection
- periodic image turns to a realtime model
- audio events, sensors, or backend events
Observation contract
Normalize provider output before app logic sees it. Avoid letting raw VLM text, detector envelopes, transcripts, or provider-specific schemas directly drive behavior. Prefer small app-owned events:Stabilization
Continuous inference is noisy. Define the trigger policy between observations and app behavior. Useful rules include:- require the same normalized result for N consecutive observations
- require a condition to hold for M milliseconds
- count rising edges instead of every positive frame
- require confidence or agreement across providers
- require explicit user confirmation for important actions
- ignore observations whose generation, task id, prompt, or detector does not match the active request
Workflow authority
Workflow authority means one controller owns the current app state, active perception request, and effect of each observation. For Rokid-class glasses apps, this should usually be the backend. Serialize per-session workflow changes through that controller before mutating state. For each state, define:- the active perception query
- the observation schema accepted in that state
- the trigger or stabilization rule
- the transition, feedback, or action caused by a valid observation
- how old perception results are invalidated when the state changes