Skip to main content
Proactive glasses apps continuously observe real-world context and react when something meaningful changes. They are not just chat flows waiting for user prompts. The main design problem is turning noisy perception into reliable app events. Use this pattern when camera, audio, sensor, or backend observations should guide, alert, adapt, or trigger actions without requiring an explicit command for every step. Related references:
  • Rokid WebRTC: media streaming and data channel setup.
  • OpenAI Realtime: realtime model turns, backend-controlled speech, and sideband control.
  • Object Detection: detector events, confirmation rules, and detection-driven task progression.

Core loop

camera/audio/sensors
  -> perception loop
  -> normalized observation
  -> stabilization or trigger policy
  -> app workflow/controller
  -> wearer feedback or action
Feedback can be visual display, audio, haptics if available, logs, or backend actions. The proactive pattern does not require any one output channel. Perception providers should report observations; the app controller should own workflow changes and effects.

Perception loop

The perception loop can use any provider that turns live context into observations:
  • continuous VLM inference, such as Overshoot
  • object detection
  • OCR
  • barcode or marker detection
  • hand, pose, or gesture detection
  • periodic image turns to a realtime model
  • audio events, sensors, or backend events
Overshoot is useful for this shape because it can run continuous VLM inference over a live WebRTC stream and return structured results. Treat it as one possible perception provider behind the same observation contract, not as a requirement of the architecture.

Observation contract

Normalize provider output before app logic sees it. Avoid letting raw VLM text, detector envelopes, transcripts, or provider-specific schemas directly drive behavior. Prefer small app-owned events:
{
  "type": "observation",
  "generation": 4,
  "source": "vision",
  "task_id": "find-ingredient",
  "value": {
    "visible_items": ["lime", "cup"],
    "ready": true
  }
}
The contract should make stale-result checks possible. Include a generation, active task id, prompt id, detector id, or equivalent field when the perception request changes over time. Add a timestamp only when time-based stabilization needs it.

Stabilization

Continuous inference is noisy. Define the trigger policy between observations and app behavior. Useful rules include:
  • require the same normalized result for N consecutive observations
  • require a condition to hold for M milliseconds
  • count rising edges instead of every positive frame
  • require confidence or agreement across providers
  • require explicit user confirmation for important actions
  • ignore observations whose generation, task id, prompt, or detector does not match the active request
Do not emit user-facing feedback or external actions on every inference callback unless the app explicitly needs a live debug stream.

Workflow authority

Workflow authority means one controller owns the current app state, active perception request, and effect of each observation. For Rokid-class glasses apps, this should usually be the backend. Serialize per-session workflow changes through that controller before mutating state. For each state, define:
  • the active perception query
  • the observation schema accepted in that state
  • the trigger or stabilization rule
  • the transition, feedback, or action caused by a valid observation
  • how old perception results are invalidated when the state changes
This prevents raw model responses, client timers, transcripts, and disconnected services from independently advancing the app.

Example

A concrete example of this pattern is the proactive drink-making coach in GlassKit: https://github.com/RealComputer/GlassKit/tree/main/examples/rokid-overshoot-openai-realtime That example uses Overshoot for the continuous VLM loop and OpenAI Realtime for spoken guidance, but the pattern generalizes to other perception providers and workflow domains. Its backend queues perception, OpenAI, control, and disconnect events through one session loop; Android streams media, sends gestures, and renders HUD/transcripts.