← Projects

ChorusReader

Pre-launch

Every character gets a voice

Takes your books and gives every character a distinct voice. 30 built-in voices, 12 emotion categories, real-time synthesis on iPhone. No per-character cloud fees, no data leakage. Powered by ChorusTTS — a custom accelerated pipeline built on Chatterbox weights.

voices
30
emotions
12
speed
20% faster than realtime on iPhone
Swift MLX ChorusTTS AVFoundation
// WHY MULTI-VOICE

Audiobook narration is a solved problem if you're fine with one voice reading everything. But characters have voices. Dialogue has emotion. ChorusReader assigns distinct voices to characters and modulates emotion per scene — happy, sad, angry, contemplative. All running on your iPhone, no cloud API calls, no per-character fees.

// CAPABILITIES
  • 30 built-in voices with distinct timbres
  • 12 emotion categories for expressive narration
  • ePub and PDF support with chapter detection
  • Real-time synthesis — 20% faster than realtime on iPhone
  • Powered by ChorusTTS — a custom accelerated pipeline built on Chatterbox weights (thanks Resemble AI)
// DEMO

Multi-voice narration — each character gets a distinct voice and emotion.

// THE TTS PIPELINE

ChorusTTS is a custom Swift pipeline built on Chatterbox weights by Resemble AI. Three stages turn text into speech — each optimized for Apple Silicon and quantized to fit on a phone.

End-to-end pipeline from book analysis to streaming playback. Thanks to Resemble AI for open-sourcing the Chatterbox weights.

// KEY DECISIONS
  • Hybrid book analysis — regex-based quote detection feeds into on-device LLM (Foundation Models) for speaker attribution and emotion tagging
  • Each voice is a set of speaker embeddings, not a separate model — switching voices is instant, costs no additional memory
  • 12 emotion profiles map to concrete TTS parameters (temperature, guidance weight, pacing) — the same text sounds different when a character whispers vs. shouts
  • Streaming synthesis with checkpoint resume — if the app is interrupted mid-generation, it picks up exactly where it left off
  • Voice cloning from a few seconds of reference audio via an LSTM encoder — bring your own narrator