
Comprehensive benchmark of 18 on-device text-to-speech models including Kokoro, Piper, Matcha, Kitten, and VITS on Android and iOS
Source Code:
- android-offline-tts-eval — Android TTS benchmark app with 7 models via sherpa-onnx and system TTS
- ios-offline-tts-eval — iOS TTS benchmark app with 11 models including AVSpeech
Abstract
We benchmark 18 on-device text-to-speech models across 2 inference engines on Android (7 models) and iOS (11 models), measuring synthesis speed (tok/s), real-time factor (RTF), and memory usage. All benchmarks use English text prompts only. Results show Android System TTS and Piper VITS achieve the fastest synthesis (33–42 tok/s on Android), while Kokoro runs slower than real-time on both platforms. On iOS, Apple’s built-in AVSpeechSynthesizer scores highest overall due to minimal memory overhead, but Matcha + Vocos provides the best balance of speed and resource efficiency among open-source models. No formal listening test (MOS/ABX) was conducted — this benchmark measures speed and resource usage, not perceptual voice quality.
Motivation
Developers integrating TTS into mobile and edge applications face a three-way trade-off: synthesis latency (can the model keep up with real-time interaction?), memory footprint (can it coexist with ASR and other AI models on a memory-constrained device?), and voice quality (is the output acceptable for the use case?). TTS models range from 440 ms system TTS to 15-second Kokoro — a 35x speed difference — with memory from 21 MB to 833 MB.
Existing TTS comparisons typically evaluate quality (MOS scores) on server hardware, but for on-device deployment the binding constraints are speed and memory, not quality alone. A model that sounds excellent but takes 15 seconds to synthesize a sentence is unusable for interactive applications. This benchmark provides the speed and memory data developers need to make the UX trade-off decision: which model delivers acceptable latency on real mobile hardware?
Methodology
Both platforms use a standardized set of 12 English text prompts of varying length and complexity. All results in this benchmark reflect English synthesis performance only — multilingual models were not evaluated in other languages. Each model is evaluated in warm mode (1 warm-up iteration) to measure steady-state performance.
Metrics:
- tok/s: Output tokens (words) synthesized per second (higher = faster)
- RTF: Real-Time Factor — ratio of synthesis time to audio duration (below 1.0 = faster than real-time)
- Overall Score: Composite metric (iOS only, 0–100 scale) = weighted combination of Speed Score (tok/s normalized + RTF penalty) and Memory Score (inverse of memory usage). Models with RTF > 1.5 receive a Speed Score of 0. Full formula in ios-offline-tts-eval source. This score does not include voice quality — no formal listening test (MOS/ABX) was conducted.
Devices:
| Device | Chip | RAM | OS |
|---|---|---|---|
| Samsung Galaxy S10 | Exynos 9820 | 8 GB | Android 12 (API 31) |
| iPad Pro 3rd gen | A12X Bionic | 4 GB | iPadOS 17+ |
Android Results
Device: Samsung Galaxy S10, Android 12, API 31, 4 threads
| Model | Engine | Median Synth (ms) | Median tok/s | Median RTF | Status |
|---|---|---|---|---|---|
| Android System TTS | android_system_tts | 440 | 42.48 | 0.058 | PASS |
| Piper (ryan-low) | sherpa-onnx | 478 | 39.14 | 0.077 | PASS |
| Piper (amy-low) | sherpa-onnx | 524 | 33.39 | 0.076 | PASS |
| Matcha-Icefall (LJSpeech + HiFiGAN) | sherpa-onnx | 1,104 | 16.37 | 0.135 | PASS |
| Kitten Nano (en v0.2 fp16) | sherpa-onnx | 3,526 | 5.18 | 0.387 | PASS |
| Kokoro (en v0.19) | sherpa-onnx | 8,226 | 2.37 | 1.133 | PASS |
| Kokoro Int8 (multi-lang v1.1) | sherpa-onnx | 15,343 | 1.25 | 2.423 | PASS |
All 7 models PASS — no crashes or OOM conditions on 8 GB device.
Android Speed Observations
Android System TTS and Piper VITS models are the fastest (33–42 tok/s). Kokoro models run slower than real-time (RTF > 1.0) but are designed for higher voice quality. Matcha-Icefall offers a middle ground at 16 tok/s. Note: this benchmark measures speed and resource usage only — no formal listening test (MOS) was conducted, so quality comparisons are based on the models’ published characteristics.
iOS Results
Device: iPad Pro 3rd gen, A12X Bionic, 4 GB RAM
| Model | Engine | Overall Score | Speed Score | Median tok/s | Median RTF | Memory (MB) |
|---|---|---|---|---|---|---|
| AVSpeech (System) | native | 100.00 | — | 151.34 | — | 21 |
| Matcha (LJSpeech) + Vocos | sherpa-onnx | 87.77 | 94.39 | 25.68 | 0.084 | 211 |
| Kitten Nano EN (v0.2 fp16) | sherpa-onnx | 59.72 | 75.45 | 5.14 | 0.368 | 193 |
| Kitten Nano (en v0.1 fp16) | sherpa-onnx | 58.90 | 72.86 | 5.61 | 0.407 | 108 |
| Kokoro EN (v0.19) | sherpa-onnx | 43.59 | 58.60 | 4.01 | 0.621 | 833 |
| Kitten Mini EN (v0.1 fp16) | sherpa-onnx | 24.57 | 24.30 | 1.63 | 1.135 | 427 |
| VITS LJS (Int8) | sherpa-onnx | 21.41 | 0.00 | 1.20 | 2.023 | 140 |
| VITS VCTK (Int8) | sherpa-onnx | 20.98 | 0.00 | 1.43 | 2.062 | 122 |
| VITS Melo (ZH+EN, Int8) | sherpa-onnx | 20.07 | 0.00 | 0.83 | 2.874 | 211 |
| Kokoro Int8 (Multi-lang v1.0) | sherpa-onnx | 17.06 | 0.00 | 1.40 | 1.822 | 515 |
| Kokoro Multi-lang INT8 (v1.1) | sherpa-onnx | 16.91 | 0.00 | 1.71 | 1.569 | 588 |
iOS Observations
- AVSpeech scores highest overall due to negligible memory usage (21 MB vs 100–800 MB for open-source models) and fast synthesis, though voice quality is limited to Apple’s built-in voices.
- Matcha + Vocos is the best open-source option on iOS — fast (RTF 0.08), high overall score (87.8), with moderate memory at 211 MB.
- Kitten Nano models offer a good balance — RTF below 0.5 with reasonable memory (108–193 MB).
- Kokoro EN (v0.19) scores 43.6 overall with RTF 0.62 — faster than real-time but memory-heavy at 833 MB, the largest footprint in this benchmark.
- VITS and Kokoro Int8 variants all run slower than real-time on iPad (RTF > 1.0), making them impractical for interactive use.
Limitations
- No voice quality evaluation: This benchmark does not include perceptual quality metrics (MOS, ABX, or listening tests). Quality comparisons reference the models’ published characteristics only.
- English only: All prompts are in English. Multilingual models (Kokoro multi-lang, VITS Melo ZH+EN) were not evaluated in their other supported languages.
- Single device per platform: Results are from one Android and one iOS device. Performance may vary on other chipsets.
- Overall Score excludes quality: The iOS composite score reflects speed and memory efficiency only — a high-scoring model is not necessarily the best-sounding one.
Further Research
- Perceptual quality study: Run MOS/ABX listening tests with human raters to validate quality claims beyond speed/memory metrics.
- Multilingual prompt suite: Evaluate non-English synthesis quality and speed for multilingual models, not only English prompts.
- Prosody and style control: Benchmark controllability (emotion, speaking rate, punctuation sensitivity) for dialogue and assistant use cases.
- Streaming TTS latency: Measure time-to-first-audio and chunk-level latency for interactive assistants.
- Compression and mobile scaling: Compare INT8/INT4 and voice-cloning variants under strict mobile RAM budgets.
Conclusion
On-device TTS is viable on both Android and iOS, but model choice depends heavily on the use case. For real-time interactive applications, Android System TTS / Piper (Android) or Matcha + Vocos (iOS) provide fast synthesis (RTF well below 1.0). For pre-generated audio or non-interactive use where latency is acceptable, Kokoro offers richer voice output at the cost of higher synthesis time and memory. System TTS engines remain competitive — Android’s built-in TTS is the fastest option (42 tok/s), while Apple’s AVSpeechSynthesizer scores highest on iOS due to its minimal resource footprint (21 MB).
On iOS, memory consumption varies widely — from 108 MB (Kitten Nano) to 833 MB (Kokoro EN) — which directly impacts whether a model can coexist with other AI workloads on memory-constrained edge devices.
References
Our Repositories:
- android-offline-tts-eval — Android TTS benchmark app (Apache 2.0)
- ios-offline-tts-eval — iOS TTS benchmark app (Apache 2.0)
Models:
- Kokoro — StyleTTS2-based, high-quality multilingual TTS
- Piper — Fast VITS-based TTS with many voices
- Matcha-TTS — Flow-matching TTS with Vocos vocoder
- Kitten Nano/Mini — Lightweight neural TTS
- MMS-TTS — Meta’s Massively Multilingual Speech TTS (1,100+ languages)
Inference Engine:
- sherpa-onnx — Next-gen Kaldi ONNX Runtime (supports TTS models)


