prism

contraption

Inspiration

Hearing aids amplify everything indiscriminately. In noisy environments like restaurants or crowded rooms, users hear background noise at the same volume as the person they're trying to talk to. Directional microphones exist as a partial workaround, but they only help when sound comes from a specific direction and introduce their own distortion and noise.We wanted to explore a different approach: rather than filtering by direction, filter by speaker identity. If a hearing aid could learn someone's voice and isolate only that voice in real time, it could work regardless of where noise is coming from.

What It Does

Prism is a real-time voice isolation system that learns the voice of a specific person and filters out everything else. It uses both face and voice information to build a speaker profile, then applies that profile to live audio to let only that speaker through despite background noise or interruptions.

To start, you look at the person that you want to talk to and scan, at which point we take a 5 second video from which we use Dolphin to extract their voice.
After that point any time the enrolled person talks their voice is streamed through while extraneous noise is ignored, so you can listen to the person that you need to clearly. You can enroll multiple people at once, have two enrolled speakers speak at the same time, add more people etc. and the system adapts in real time as the conversation progresses.

How We Built It

P Prism is a multi-layer real-time audio pipeline running across cloud GPUs, local audio I/O

Speaker Enrollment via Audio-Visual Targeting

When you scan someone, main_app2.py uses MediaPipe BlazeFace to detect faces in frame and compute their azimuth angle relative to the camera center. Pressing 'i' captures a synchronized 5-second MP4 — 25 FPS video and 16 kHz mono audio mixed together.

That clip is sent to a Modal A10G GPU running Dolphin, a multimodal audio-visual speech extraction model from JusperLee/Dolphin. Dolphin uses lip movement data alongside audio to isolate the target speaker's voice from the clip. The resulting audio is passed to ECAPA-TDNN (speechbrain/spkrec-ecapa-voxceleb), which produces a 192-dimensional speaker embedding — a numerical representation of that person's voice characteristics. This embedding is saved locally alongside a reference audio clip and reloaded into the live audio stream without requiring a reconnection.

Real-Time Target Speaker Extraction

Live microphone audio is streamed in 150ms chunks (2,400 samples at 16 kHz) over a WebSocket to a Modal H100 GPU backend. The backend runs TSE (Target Speaker Extraction) on every chunk, using the enrolled speaker embedding and reference audio clip to directly extract only the target speaker's voice from the incoming audio. Unlike approaches that separate audio into a fixed number of tracks, TSE conditions on the reference audio itself — so it filters out everything that doesn't match the target speaker's voice characteristics, whether that's background noise, music, or multiple people talking simultaneously.

A hysteresis gate sits on top of the TSE output: if similarity to the enrolled profile crosses 0.35, the gate opens; if it drops but stays above 0.25, the gate holds open for 0.8 seconds to account for natural pauses in speech; if it falls below that threshold for longer, the gate closes. The response sent back to the client is a 16-byte binary header (four float32 values: sim1, sim2, gain1, gain2) followed by processed int16 audio. The client plays back through a 2-chunk jitter buffer to smooth network variance.

Fallback Paths

For local development without cloud access, noise_gate/target_speaker_vad.py runs a CPU-only STFT spectral subtraction pipeline with local ECAPA-TDNN at around 10ms latency. A FastAPI local embedding server (speaker_embedding/local_speaker_embedding.py) provides a drop-in replacement for the cloud enrollment endpoint.

Challenges We Ran Into

Latency management across the WebSocket. Network jitter made naive playback stutter constantly. We had to implement a jitter buffer with dynamic depth management — buffering 2+ chunks before playback starts, and actively dropping stale chunks if the buffer grew too deep. Getting the balance right between latency and smoothness took significant tuning.

Dolphin's dependency conflicts. Dolphin requires TensorFlow for its face detection backbone (RetinaFace) but also uses PyTorch for the core separation model. Getting both to coexist on the same A10G without CUDA library conflicts required forcing TensorFlow to CPU and pinning to legacy Keras 2 mode.

SepFormer's hard 2-speaker limit. SepFormer outputs exactly 2 source tracks regardless of how many people are in the room. We ended up replacing sepformer with TSE to solve.

Real-time video + audio synchronization. Recording a synchronized MP4 for Dolphin requires careful alignment of the video frame buffer and audio sample buffer. Off-by-one issues between the 25 FPS video clock and the 16 kHz audio clock caused subtle sync drift that degraded Dolphin's lip-audio alignment and reduced extraction quality.

Accomplishments That We're Proud Of

Sub-500ms end-to-end latency from microphone input to processed audio output over a real network because of Modal.
Hot-reload embeddings mid-stream — you can enroll a new speaker or update an existing fingerprint without any interruption to the live audio session.
The Dolphin + ECAPA pipeline fully works end-to-end — a 5-second face scan produces a voice embedding that the live gate actually recognizes reliably.
Graceful degradation across three execution modes — cloud GPU, local CPU VAD, and local embedding server all work independently, so the system never fully fails even without cloud connectivity.

What We Learned

Source separation and speaker verification are coupled problems, you can't treat them independently without introducing domain mismatch.
Multimodal approaches (audio + video) dramatically improve speaker isolation over audio-only methods, especially in overlapping speech scenarios.
Real-time audio systems need layered buffering strategies. A single jitter buffer isn't enough, you need to account for callback alignment, network variance, and processing time independently.
Modal's serverless GPU infrastructure makes it practical to run H100-class models at sub-second latency from a consumer laptop.
The most significant UX win wasn't audio quality, it was eliminating the need for users to restate who they're talking to. Biometric locking just works in the background.

What's Next for Prism

On-device inference — port the ECAPA-TDNN embedding to CoreML or ONNX so enrollment and gating can run locally on-device, enabling a true hearing aid form factor with no cloud dependency.
Adaptive enrollment — continuously update the speaker embedding as more voice samples accumulate in the session, improving accuracy over time rather than relying on a single 5-second clip.
Improved Hardware integration — package Prism into a wearable form factor with a microphone array and bone conduction speaker, so the entire experience runs without a laptop or phone in hand instead of just the janky webcam headband.