nathanrenting.dev
Project · MVP scaffold

CaptionCompass — see the words, sense where they come from

A caption-first accessibility app for deaf and hard-of-hearing Android users. Live captions are always visible. A rough directional hint of the speaker appears only when it is reliable.

Hand-drawn audio pipeline: stereo mics on phone → AAudio UNPROCESSED → VAD → two parallel paths (GCC-PHAT DoA → Zone Smoother with 5 zones, and ASR via SpeechRecognizer / Vosk) → phone screen with DIRECTION arc + CAPTIONS. Caption: 'show less, not more — direction only when reliable'.

Whiteboard sketch · the audio pipeline

The product philosophy in five sentences

The directional hint is opt-in and gated. It appears only when:

  1. The user has turned it on in settings, and
  2. The device exposes two usable microphones via the UNPROCESSED audio source, and
  3. The phone is lying flat and steady (gyroscope check), and
  4. Confidence is high or medium.

If any one of those conditions falls away, the captions keep working unchanged and the direction arc fades out. "Show less, not more" — not turning a feature off, but respectfully withdrawing the feature when it cannot be backed up firmly enough.

What's in the MVP

LayerStatus
Project / Gradle / Manifestready
Domain model + fallback policyready, unit-tested
Capability probe (looks for UNPROCESSED + stereo)ready
AudioRecord stereo captureready
Energy VAD (voice activity detection)ready
GCC-PHAT DoA + zone smootherready, unit-tested on τ→zone mapping
Android SpeechRecognizer adapterready
Mock audio + mock ASR (for emulator work)ready
Foreground service (Android 14 mic type)ready
Compose UI (status bar, direction arc, captions, controls)ready
Phase 2Vosk fallback ASR, Silero-VAD ONNX
Phase 4Stereo BLE input (external stereo mics)

Stack

Language
Kotlin 2.0.20
UI
Jetpack Compose (BOM 2024.10), single-screen single-activity
Audio capture
AudioRecord with UNPROCESSED source and stereo channel mask
VAD
Energy-based (Phase 1), Silero-VAD ONNX (Phase 2)
DoA
GCC-PHAT (generalized cross-correlation phase transform), custom Kotlin implementation
ASR
Android SpeechRecognizer (online), Vosk (Phase 2, offline fallback)
Service
Foreground service with Android 14 microphone foreground type
Min SDK
26 (Android 8.0+)
Target SDK
35 (Android 15)

Why GCC-PHAT and not an ML model for direction

DoA with two microphones is a solved mathematical problem if you frame it correctly. GCC-PHAT gives a time-difference-of-arrival from which a single azimuth follows. On a phone with ~14cm between the mics this yields no degree-level precision, but it does give a reliable left / centre / right / rear-left / rear-right indication (five zones). That is enough for the use case without ML models that cost latency and battery.

For confidence evaluation, the cross-correlation peak sharpness is used — a high peak means one clear source; a flat peak means multiple speakers or noise, and then the arc is faded out instead of showing a misleading direction.

Status + roadmap

Now: MVP scaffold ready. Domain layer unit-tested, Compose UI runs, stereo capture works, captions work via SpeechRecognizer. Mock mode for emulator development.

Phase 2: Vosk offline ASR + Silero-VAD ONNX (better VAD, device-independent).

Phase 4: Stereo BLE input — connect external stereo mics over Bluetooth for better DoA precision than the built-in phone mics.