Project · MVP scaffold

CaptionCompass — see the words, sense where they come from

A caption-first accessibility app for deaf and hard-of-hearing Android users. Live captions are always visible. A rough directional hint of the speaker appears only when it is reliable.

Hand-drawn audio pipeline: stereo mics on phone → AAudio UNPROCESSED → VAD → two parallel paths (GCC-PHAT DoA → Zone Smoother with 5 zones, and ASR via SpeechRecognizer / Vosk) → phone screen with DIRECTION arc + CAPTIONS. Caption: 'show less, not more — direction only when reliable'. — Whiteboard sketch · the audio pipeline

The product philosophy in five sentences

The directional hint is opt-in and gated. It appears only when:

The user has turned it on in settings, and
The device exposes two usable microphones via the UNPROCESSED audio source, and
The phone is lying flat and steady (gyroscope check), and
Confidence is high or medium.

If any one of those conditions falls away, the captions keep working unchanged and the direction arc fades out. "Show less, not more" — not turning a feature off, but respectfully withdrawing the feature when it cannot be backed up firmly enough.

What's in the MVP

Layer	Status
Project / Gradle / Manifest	ready
Domain model + fallback policy	ready, unit-tested
Capability probe (looks for UNPROCESSED + stereo)	ready
`AudioRecord` stereo capture	ready
Energy VAD (voice activity detection)	ready
GCC-PHAT DoA + zone smoother	ready, unit-tested on τ→zone mapping
Android `SpeechRecognizer` adapter	ready
Mock audio + mock ASR (for emulator work)	ready
Foreground service (Android 14 mic type)	ready
Compose UI (status bar, direction arc, captions, controls)	ready
Phase 2	Vosk fallback ASR, Silero-VAD ONNX
Phase 4	Stereo BLE input (external stereo mics)

Stack

Language

Kotlin 2.0.20

Jetpack Compose (BOM 2024.10), single-screen single-activity

Audio capture

AudioRecord with UNPROCESSED source and stereo channel mask

VAD

Energy-based (Phase 1), Silero-VAD ONNX (Phase 2)

DoA

GCC-PHAT (generalized cross-correlation phase transform), custom Kotlin implementation

ASR

Android SpeechRecognizer (online), Vosk (Phase 2, offline fallback)

Service

Foreground service with Android 14 microphone foreground type

Min SDK

26 (Android 8.0+)

Target SDK

35 (Android 15)

Why GCC-PHAT and not an ML model for direction

DoA with two microphones is a solved mathematical problem if you frame it correctly. GCC-PHAT gives a time-difference-of-arrival from which a single azimuth follows. On a phone with ~14cm between the mics this yields no degree-level precision, but it does give a reliable left / centre / right / rear-left / rear-right indication (five zones). That is enough for the use case without ML models that cost latency and battery.

For confidence evaluation, the cross-correlation peak sharpness is used — a high peak means one clear source; a flat peak means multiple speakers or noise, and then the arc is faded out instead of showing a misleading direction.

Status + roadmap

Now: MVP scaffold ready. Domain layer unit-tested, Compose UI runs, stereo capture works, captions work via SpeechRecognizer. Mock mode for emulator development.

Phase 2: Vosk offline ASR + Silero-VAD ONNX (better VAD, device-independent).

Phase 4: Stereo BLE input — connect external stereo mics over Bluetooth for better DoA precision than the built-in phone mics.