ChorusReader

Active

Every character gets a voice

Takes your books and gives every character a distinct voice. 30 built-in voices, 12 emotion categories, real-time synthesis on iPhone. No per-character cloud fees, no data leakage. Powered by ChorusTTS — a custom accelerated pipeline built on Chatterbox weights.

voices

emotions

speed

20% faster than realtime on iPhone

Swift MLX ChorusTTS AVFoundation

apps.apple.com/us/app/chorus-reader/id6759882177 ↗ App Store →

// WHY MULTI-VOICE

Audiobook narration is a solved problem if you're fine with one voice reading everything. But characters have voices. Dialogue has emotion. ChorusReader assigns distinct voices to characters and modulates emotion per scene — happy, sad, angry, contemplative. All running on your iPhone, no cloud API calls, no per-character fees.

// CAPABILITIES

30 built-in voices with distinct timbres
12 emotion categories for expressive narration
ePub and PDF support with chapter detection
Real-time synthesis — 20% faster than realtime on iPhone
Powered by ChorusTTS — a custom accelerated pipeline built on Chatterbox weights (thanks Resemble AI)

// DEMO

Multi-voice narration — each character gets a distinct voice and emotion.

// THE TTS PIPELINE

ChorusTTS is a custom Swift pipeline built on Chatterbox weights by Resemble AI. Three stages turn text into speech — each optimized for Apple Silicon and quantized to fit on a phone.

  ┌────────────────────────────────────────────────┐
  │              Book Analysis                     │
  │  ┌──────────┐  ┌───────────┐  ┌────────────┐  │
  │  │  Quote   │  │ Character │  │  Emotion   │  │
  │  │Detection │─▶│Attribution│─▶│Enrichment  │  │
  │  │(heuristic│  │(LLM + turn│  │(per-segment│  │
  │  │ + regex) │  │ tracking) │  │ profiling) │  │
  │  └──────────┘  └───────────┘  └─────┬──────┘  │
  └─────────────────────────────────────┼──────────┘
                                        │
              30 voice presets ──────────┤
              or cloned voice           │
              (a few seconds            ▼
               of reference    ┌─────────────────┐
               audio)          │   T3 Transformer │
                               │  text → speech   │
                               │     tokens       │
                               └────────┬────────┘
                                        │
                                        ▼
                               ┌─────────────────┐
                               │  S3Gen (flow     │
                               │  matching)       │
                               │  tokens → mel    │
                               │  spectrogram     │
                               └────────┬────────┘
                                        │
                                        ▼
                               ┌─────────────────┐
                               │  HiFT Vocoder   │
                               │  mel → waveform  │
                               │  24kHz audio     │
                               └────────┬────────┘
                                        │
                          ┌─────────────▼──────────────┐
                          │   Streaming Playback       │
                          │   (plays as chunks arrive, │
                          │    checkpoint resume)      │
                          └────────────────────────────┘

  iOS: 4-bit quantized, compiled to CoreML on first launch
  macOS: 8-bit quantized, native MLX execution

End-to-end pipeline from book analysis to streaming playback. Thanks to Resemble AI for open-sourcing the Chatterbox weights.

// KEY DECISIONS

Hybrid book analysis — regex-based quote detection feeds into on-device LLM (Foundation Models) for speaker attribution and emotion tagging
Each voice is a set of speaker embeddings, not a separate model — switching voices is instant, costs no additional memory
12 emotion profiles map to concrete TTS parameters (temperature, guidance weight, pacing) — the same text sounds different when a character whispers vs. shouts
Streaming synthesis with checkpoint resume — if the app is interrupted mid-generation, it picks up exactly where it left off
Voice cloning from a few seconds of reference audio via an LSTM encoder — bring your own narrator