- media moves over a WebRTC peer connection
- user inputs and model outputs are stored as conversation items
- voice activity detection turns completed user speech into model output automatically
response.createcan be used for manual backend-controlled turns, such as tool output, injected images, or workflow-authored speech
- tools, transcripts, lifecycle events, and control messages move as JSON events
https://api.openai.com/v1/realtime/calls with a local API key; do not ship that pattern because it embeds the key in the APK.
Use gpt-realtime-2 for new Realtime integrations. Start with reasoning.effort: "low" for responsive speech-to-speech behavior, then raise it only for workflows that need deeper multi-step planning. Use the GA Realtime docs and avoid older beta-era API shapes or model names.
Related reference: Rokid WebRTC covers the Android WebRTC setup, receive-only audio transceivers, SDP normalization, ICE, and lifecycle cleanup that this document assumes.
How it works
Think in two planes:- Media plane: Android’s WebRTC peer connection carries microphone audio, optional camera media, and remote assistant audio.
- Control plane: JSON events move over the
oai-eventsWebRTC data channel and a backend WebSocket sideband channel attached to the same Realtime call.
input_image conversation items and do not currently document live video input over WebRTC. GlassKit has a validated direct-vision path where camera media over WebRTC works; keep input_image as the documented image path for backend augmentation and fallback designs.
The important objects are:
- Session: model, voice, instructions, audio config, tools, turn detection, and output modalities.
- Conversation item: a user message, assistant message, tool call, tool output, image input, or audio item in the live conversation.
- Response: one model turn. In the default path, Realtime creates the response automatically after VAD decides the user has stopped speaking. Use
response.createwhen the app has disabled automatic responses or when the backend adds an item that needs a model answer. - Sideband: a server control channel attached to the call. Use it when backend business logic, tools, workflow state, or guardrails must stay server-side.
Response creation
The default pattern is automatic response creation. Leave VAD enabled and let Realtime decide when the user has finished speaking. Use explicitresponse.create only for backend-gated turns. In that mode, keep VAD enabled for turn detection but set create_response to False, wait for the completed user turn or backend workflow event, add any required conversation items, and then send exactly one response.create.
Pattern matrix
| Pattern | Android media into Realtime | Session input config | Response trigger | Backend sideband role |
|---|---|---|---|---|
| Direct assistant | Mic audio, optional camera video | audio.input.turn_detection.type = semantic_vad; transcription enabled when the client renders user text | Automatic VAD response | Handle tools and optional observability |
| Backend-augmented vision | Mic audio only; camera goes to backend vision service | Same as direct assistant, but create_response = False | Backend waits for the committed user audio item, injects image/context, then sends response.create | Store latest vision result, inject context, handle tools |
| Output-only backend speech | No mic track; receive-only audio transceiver | Usually omit audio.input; configure only output voice | Backend creates text items and sends response.create | Own workflow state, cancel/replace speech, handle tools |
Common patterns
Direct assistant
Use this when the model can own the conversation. Android streams microphone audio and optionally camera media to the Realtime peer connection. Android receives remote assistant audio and renders transcript events fromoai-events.
This is the simplest pattern for a conversational assistant. The backend still brokers SDP and handles tools so secrets and private data stay off the glasses. Keep automatic turn creation enabled for this pattern.
Session shape:
Backend-augmented vision
Use this when you need to add hints for more reliable spatial understanding, object detection, or domain-specific vision. Android runs two links:- audio to OpenAI Realtime, brokered by the backend;
- camera video to a backend vision service, such as object detection models.
input_image or other additional context as an item and then send response.create.
Keep VAD, but disable automatic response creation so the backend has time to inject the image or structured vision result before the model answers:
Backend-controlled speech
Use this for server-authoritative workflows where the backend decides each step, client-visible state, and exact spoken line. Use a receive-only audio transceiver on Android so the SDP offer has an audio section, but do not add a microphone track unless the workflow needs user audio in this Realtime session. For output-only speech, keep the Realtime session small:Speak exactly this line: ... followed by response.create.
System instructions
Realtime instructions should define the assistant’s role, speaking style, visual grounding rules, and tool policy. For smart glasses, keep spoken output short, state what to do next, and explicitly handle unclear audio or poor framing. Put workflow authority in the backend when the app has external state, safety constraints, or deterministic step progression. Example:instructions field from Python, JavaScript, or any other backend that creates the Realtime session.
Backend SDP broker
The backend can be written in any language that can accept SDP and send a multipart request. These are Python/FastAPI snippets. Endpoint contract:call_id:
session.update, call tools, insert conversation items, cancel active speech, and create responses.
Android client contract
Use theoai-events data channel for Realtime event JSON:
- add a local microphone audio track;
- add a camera track only for the direct-vision path you have validated;
- set
OfferToReceiveAudioto"true"so assistant speech plays on the device; - parse
conversation.item.input_audio_transcription.completedfor user text; - parse
response.output_audio_transcript.donefor final assistant text; - parse
response.output_audio_transcript.deltaonly when rendering live captions.
- do not add a local microphone track unless the app needs user audio in this Realtime session;
- add a receive-only audio transceiver so the offer has an
m=audiosection; - require the local SDP to contain
m=audiobefore posting it; - render transcript deltas only for the current speech item;
- clear stale transcript text when backend
speech_epochchanges.
event_id where possible:
Backend-controlled speech
Use this pattern when another backend service owns workflow state:speech_epoch or another speech item version before replacing speech. Android should treat that value as the transcript freshness key.
Track active responses from sideband events:
- set
openai_response_active = Truewhen sendingresponse.createor receivingresponse.created; - set it back to
Falseonresponse.done; - if
response.cancelreturns an error with coderesponse_cancel_not_active, treat it as benign and clear the flag.
Tool loop
Keep tools on the backend. The sideband receives the same Realtime events as the client, including completed function calls. Handle tool calls fromresponse.done, send a function output item, then continue only when the model should keep reasoning or speaking from that output.
Intermediate tools usually continue. Terminal tools should not. For example, a list or lookup tool can continue so the model can use the returned options, but a terminal action tool should usually stop because the backend has already updated workflow state and will speak the next exact line itself.
Image injection
For backend vision augmentation, insert the latest frame after Realtime has committed and added the user’s audio item, then create the response. Use this event sequence for backend-augmented vision:- On
input_audio_buffer.committed, store theitem_idas a pending user turn. - On
conversation.item.added, check that the item id is pending and that the item is a user message containinginput_audio. - Insert the latest
input_imagewithprevious_item_idset to that audio item id. - Send exactly one
response.create.
conversation.item.added; tool outputs, image items, and assistant items can also appear there.