- Android captures camera and/or microphone media.
- Android creates a WebRTC offer.
- Android sends the offer to a backend or service broker.
- The backend returns an answer SDP.
- Android sets the remote description and keeps HUD/control state outside the media track.
Integration shapes
Common patterns:- Backend media receiver: Android sends an SDP offer to your backend. If the backend is Python, use
aiortcto create the peer connection, receive tracks, create an answer, and send app state back on a data channel. - Backend service broker: Android sends an SDP offer to your backend. The backend creates an upstream vendor stream/call, returns the vendor answer SDP, and relays service events to Android. This fits realtime media APIs.
aiortc for Python backend code that terminates WebRTC or needs to generate a local answer SDP. Provider brokers that only forward Android’s offer to an upstream service may not need local WebRTC objects.
Android setup
Use Stream’s WebRTC package. This is the known working version:RECORD_AUDIO only when Android captures local microphone audio. Receive-only remote audio playback and transcript rendering do not need RECORD_AUDIO.
Use android:usesCleartextTraffic="true" only for local http:// development backends.
Create and release these explicitly:
EglBasePeerConnectionFactoryPeerConnection- video capturer and
SurfaceTextureHelper - local audio/video sources and tracks
- data channels
- WebSocket or HTTP signaling clients
JavaAudioDeviceModuleif WebRTC owns microphone or speaker routing
Peer connection factory
Initialize WebRTC once per client lifecycle and reuse onePeerConnectionFactory for a session client:
.setAudioDeviceModule(audioDeviceModule). The USAGE_MEDIA route and disabled hardware AEC/NS avoid Rokid vendor VOIP-path issues during simultaneous capture and playback.
Peer connection config
Use Unified Plan:OfferToReceiveAudio to "true" only when Android should receive speech or other remote audio. Under Unified Plan, also add a receive-only audio transceiver before creating the offer so the local SDP contains an m=audio section:
m=audio section.
Most Rokid vision streams do not receive remote video.
Video capture
Rokid Glasses have a single rear/outward camera. With Stream WebRTC, create a capturer from the availableCamera2Enumerator device names:
CameraSelector.DEFAULT_BACK_CAMERA, request 1024x768 @ 15 fps, and set display rotation so the landscape sensor stream appears correctly in the portrait HUD.
Rokid’s camera HAL does not reliably advertise sub-15 fps modes. Start capture at a supported mode such as 1024x768 @ 15 fps, then use source adaptation to lower the outbound WebRTC rate when needed.
For example, capture at 15 fps and send about 5 fps:
adaptOutputFormat(...) to limit what WebRTC sends.
Avoid WebRTC silently lowering the video sender quality:
Audio tracks
For WebRTC microphone streaming:RECORD_AUDIO before starting. If the backend controls when speech plays, Android should receive audio and render transcripts, but the backend should decide exactly what to say and when.
Offer and answer
Create all local tracks and data channels before creating the offer:Content-Type: application/sdp: request body is raw offer SDP, response body is raw answer SDP.Content-Type: application/json: request body contains{ "offer_sdp": "..." }, response contains{ "answer_sdp": "...", "session_id": "..." }.
setRemoteDescription, especially when it came through JSON:
v=.
Data channels
Use a stable label per logical channel, for examplevision-events or session-events. Create client-originated channels before the offer:
type field. Ignore binary messages unless the app has a specific binary protocol.
Queue client messages until the channel is open:
onStateChange when state becomes DataChannel.State.OPEN. Backend data channel handlers should send initial app state after the channel opens, then broadcast normalized state updates. Android should parse known type values and ignore unknown ones.
ICE servers
For a backend reachable on the same network or a public WebRTC endpoint, STUN is often enough:Backend receiver pattern
For Python backends that receive media directly, useaiortc. Do not hand-roll SDP parsing or media transport:
failed, closed, or disconnected. For CV inference, consume the latest frame rather than queueing every frame; stale frame queues make HUD state lag behind reality.
Backend broker pattern
For hosted media services, the backend usually translates Android’s offer into a provider-specific session:Lifecycle
A session client should be single-start and idempotent-stop:- Ignore duplicate
start()calls whilepeerConnectionis non-null. - Stop on explicit user exit and Android
onStop(). - Close event WebSockets before disposing the peer connection.
- Tell the backend to close provider streams or media sessions when the app stops.
- Stop and dispose the capturer before disposing
SurfaceTextureHelper. - Dispose tracks and sources before disposing
PeerConnectionFactoryandEglBase. - Clear queued data-channel messages on stop.
NEW/CHECKING: starting.CONNECTED/COMPLETED: live.DISCONNECTED/FAILED: connection lost; stop or retry from a clean session.CLOSED: stopped.
Local development
Forhttp:// backends on the development machine, either enable cleartext traffic or expose HTTPS.