Vosk voice commands

Use Vosk for offline, fixed-phrase commands that mirror the Rokid touchpad actions: select, back, next, previous.

Build inputs

// app/build.gradle.kts
defaultConfig {
    ndk {
        abiFilters += listOf("arm64-v8a", "x86_64")
    }
}

dependencies {
    implementation("com.alphacephei:vosk-android:0.3.75@aar")
    implementation("net.java.dev.jna:jna:5.18.1@aar")
}

Keep the Vosk and JNA dependencies as inline strings: Gradle version catalogs lose the @aar qualifier and can pull duplicate JNA classes. Add android.permission.RECORD_AUDIO to the manifest and request it at runtime before opening AudioRecord. Bundle a model at app/src/main/assets/model-en-us/. Recommended default: vosk-model-small-en-us-0.15.

ASSET_DIR=app/src/main/assets
curl -L -o /tmp/vosk-model-small-en-us-0.15.zip \
  https://alphacephei.com/vosk/models/vosk-model-small-en-us-0.15.zip
unzip -q /tmp/vosk-model-small-en-us-0.15.zip -d /tmp
rm -rf "$ASSET_DIR/model-en-us"
mkdir -p "$ASSET_DIR"
mv /tmp/vosk-model-small-en-us-0.15 "$ASSET_DIR/model-en-us"
printf 'en-us-small-0.15-v1\n' > "$ASSET_DIR/model-en-us/uuid"

Load the bundled model through Vosk’s Android storage helper:

StorageService.unpack(
    context.applicationContext,
    "model-en-us",
    "model",
    { model -> /* create Recognizer */ },
    { exception -> /* report init failure */ }
)

Check context.assets.list("model-en-us") before unpacking so missing models report a useful runtime error.

Recognizer

private const val SAMPLE_RATE_HZ = 16_000

val commands = linkedSetOf("select", "back", "next", "previous")
val grammarJson = JSONArray().apply {
    commands.forEach { put(it) }
    put("[unk]")
}.toString()

val recognizer = Recognizer(model, SAMPLE_RATE_HZ.toFloat(), grammarJson).apply {
    setWords(false)
    setPartialWords(false)
    setEndpointerDelays(5.0f, 0.25f, 3.0f)
}

Normalize configured commands and recognized text with trim().lowercase(Locale.US). Keep [unk] in the grammar so out-of-grammar speech does not force a command. The endpoint delays above bias command recognition toward short utterances: tolerate startup silence, finalize quickly after trailing silence, and cap utterances at three seconds.

Audio loop

Feed the recognizer 16 kHz mono PCM16 from a worker thread. Use sample counts, not byte counts, when passing a ShortArray to acceptWaveForm.

val minBufferBytes = AudioRecord.getMinBufferSize(
    SAMPLE_RATE_HZ,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT
)
require(minBufferBytes > 0)

val record = AudioRecord(
    MediaRecorder.AudioSource.MIC,
    SAMPLE_RATE_HZ,
    AudioFormat.CHANNEL_IN_MONO,
    AudioFormat.ENCODING_PCM_16BIT,
    maxOf(minBufferBytes, SAMPLE_RATE_HZ * 200 / 1000 * 2)
)

check(record.state == AudioRecord.STATE_INITIALIZED)
record.startRecording()
check(record.recordingState == AudioRecord.RECORDSTATE_RECORDING)

Process.setThreadPriority(Process.THREAD_PRIORITY_AUDIO)
val buffer = ShortArray(SAMPLE_RATE_HZ * 50 / 1000)

while (!stopRequested) {
    val readCount = record.read(buffer, 0, buffer.size)
    if (readCount < 0) {
        reportAudioReadFailure(readCount)
        return
    }
    if (readCount == 0) continue

    if (recognizer.acceptWaveForm(buffer, readCount)) {
        publishPartial("")
        dispatchResult(recognizer.getResult())
    } else {
        publishPartial(partialText(recognizer.getPartialResult()))
    }
}

if (!stopRequested) {
    publishPartial("")
    dispatchResult(recognizer.getFinalResult())
}

Parse Vosk JSON with JSONObject: final results use "text" and partial results use "partial".

fun resultText(resultJson: String) = JSONObject(resultJson)
    .optString("text", "")
    .trim()
    .lowercase(Locale.US)

fun partialText(partialJson: String) = JSONObject(partialJson)
    .optString("partial", "")
    .trim()
    .lowercase(Locale.US)

val text = resultText(resultJson)
if (text in commands) {
    onCommand(text)
}

Callbacks from the recognition thread must hop to the main thread before touching Android views.

Lifecycle

Start only after the model is unpacked and RECORD_AUDIO is granted.
On stop, set a stop flag, stop AudioRecord, interrupt/join the worker briefly, release AudioRecord, clear partial UI state, and reset any audio meter to zero.
On destroy, close Recognizer and Model.
Call recognizer.reset() before each new listening session.
Suppress duplicate final commands in a short window, around 400ms, because endpointing can produce repeated finals.
Surface actionable errors for missing model, unpack failure, missing permission, invalid buffer size, recorder init/start failure, negative reads, and runtime exceptions.

​Build inputs

​Recognizer

​Audio loop

​Lifecycle

Build inputs

Recognizer

Audio loop

Lifecycle