CaptionCompass — see the words, sense where they come from
A caption-first accessibility app for deaf and hard-of-hearing Android users. Live captions are always visible. A rough directional hint of the speaker appears only when it is reliable.

Whiteboard sketch · the audio pipeline
The product philosophy in five sentences
The directional hint is opt-in and gated. It appears only when:
- The user has turned it on in settings, and
- The device exposes two usable microphones via the UNPROCESSED audio source, and
- The phone is lying flat and steady (gyroscope check), and
- Confidence is high or medium.
If any one of those conditions falls away, the captions keep working unchanged and the direction arc fades out. "Show less, not more" — not turning a feature off, but respectfully withdrawing the feature when it cannot be backed up firmly enough.
What's in the MVP
| Layer | Status |
|---|---|
| Project / Gradle / Manifest | ready |
| Domain model + fallback policy | ready, unit-tested |
| Capability probe (looks for UNPROCESSED + stereo) | ready |
AudioRecord stereo capture | ready |
| Energy VAD (voice activity detection) | ready |
| GCC-PHAT DoA + zone smoother | ready, unit-tested on τ→zone mapping |
Android SpeechRecognizer adapter | ready |
| Mock audio + mock ASR (for emulator work) | ready |
| Foreground service (Android 14 mic type) | ready |
| Compose UI (status bar, direction arc, captions, controls) | ready |
| Phase 2 | Vosk fallback ASR, Silero-VAD ONNX |
| Phase 4 | Stereo BLE input (external stereo mics) |
Stack
AudioRecord with UNPROCESSED source and stereo channel maskSpeechRecognizer (online), Vosk (Phase 2, offline fallback)microphone foreground typeWhy GCC-PHAT and not an ML model for direction
DoA with two microphones is a solved mathematical problem if you frame it correctly. GCC-PHAT gives a time-difference-of-arrival from which a single azimuth follows. On a phone with ~14cm between the mics this yields no degree-level precision, but it does give a reliable left / centre / right / rear-left / rear-right indication (five zones). That is enough for the use case without ML models that cost latency and battery.
For confidence evaluation, the cross-correlation peak sharpness is used — a high peak means one clear source; a flat peak means multiple speakers or noise, and then the arc is faded out instead of showing a misleading direction.
Status + roadmap
Now: MVP scaffold ready. Domain layer unit-tested, Compose UI
runs, stereo capture works, captions work via
SpeechRecognizer. Mock mode for emulator development.
Phase 2: Vosk offline ASR + Silero-VAD ONNX (better VAD, device-independent).
Phase 4: Stereo BLE input — connect external stereo mics over Bluetooth for better DoA precision than the built-in phone mics.