Text-to-Speech On-Device AI Benchmark Kokoro Piper Matcha Kitten Nano Offline TTS Android iOS

Offline Text-to-Speech Benchmark: 18 Models Across Android and iOS

Akinori Nakajima - VoicePing 6 min read
Offline Text-to-Speech Benchmark: 18 Models Across Android and iOS

Comprehensive benchmark of 18 on-device text-to-speech models including Kokoro, Piper, Matcha, Kitten, and VITS on Android and iOS

Source Code:

Abstract

We benchmark 18 on-device text-to-speech models across 2 inference engines on Android (7 models) and iOS (11 models), measuring synthesis speed (tok/s), real-time factor (RTF), and memory usage. All benchmarks use English text prompts only. Results show Android System TTS and Piper VITS achieve the fastest synthesis (33–42 tok/s on Android), while Kokoro runs slower than real-time on both platforms. On iOS, Apple’s built-in AVSpeechSynthesizer scores highest overall due to minimal memory overhead, but Matcha + Vocos provides the best balance of speed and resource efficiency among open-source models. No formal listening test (MOS/ABX) was conducted — this benchmark measures speed and resource usage, not perceptual voice quality.

Motivation

Developers integrating TTS into mobile and edge applications face a three-way trade-off: synthesis latency (can the model keep up with real-time interaction?), memory footprint (can it coexist with ASR and other AI models on a memory-constrained device?), and voice quality (is the output acceptable for the use case?). TTS models range from 440 ms system TTS to 15-second Kokoro — a 35x speed difference — with memory from 21 MB to 833 MB.

Existing TTS comparisons typically evaluate quality (MOS scores) on server hardware, but for on-device deployment the binding constraints are speed and memory, not quality alone. A model that sounds excellent but takes 15 seconds to synthesize a sentence is unusable for interactive applications. This benchmark provides the speed and memory data developers need to make the UX trade-off decision: which model delivers acceptable latency on real mobile hardware?

Methodology

Both platforms use a standardized set of 12 English text prompts of varying length and complexity. All results in this benchmark reflect English synthesis performance only — multilingual models were not evaluated in other languages. Each model is evaluated in warm mode (1 warm-up iteration) to measure steady-state performance.

Metrics:

  • tok/s: Output tokens (words) synthesized per second (higher = faster)
  • RTF: Real-Time Factor — ratio of synthesis time to audio duration (below 1.0 = faster than real-time)
  • Overall Score: Composite metric (iOS only, 0–100 scale) = weighted combination of Speed Score (tok/s normalized + RTF penalty) and Memory Score (inverse of memory usage). Models with RTF > 1.5 receive a Speed Score of 0. Full formula in ios-offline-tts-eval source. This score does not include voice quality — no formal listening test (MOS/ABX) was conducted.

Devices:

DeviceChipRAMOS
Samsung Galaxy S10Exynos 98208 GBAndroid 12 (API 31)
iPad Pro 3rd genA12X Bionic4 GBiPadOS 17+

Android Results

Device: Samsung Galaxy S10, Android 12, API 31, 4 threads

Android TTS Inference Speed — Tokens per Second

ModelEngineMedian Synth (ms)Median tok/sMedian RTFStatus
Android System TTSandroid_system_tts44042.480.058PASS
Piper (ryan-low)sherpa-onnx47839.140.077PASS
Piper (amy-low)sherpa-onnx52433.390.076PASS
Matcha-Icefall (LJSpeech + HiFiGAN)sherpa-onnx1,10416.370.135PASS
Kitten Nano (en v0.2 fp16)sherpa-onnx3,5265.180.387PASS
Kokoro (en v0.19)sherpa-onnx8,2262.371.133PASS
Kokoro Int8 (multi-lang v1.1)sherpa-onnx15,3431.252.423PASS

All 7 models PASS — no crashes or OOM conditions on 8 GB device.

Android Speed Observations

Android System TTS and Piper VITS models are the fastest (33–42 tok/s). Kokoro models run slower than real-time (RTF > 1.0) but are designed for higher voice quality. Matcha-Icefall offers a middle ground at 16 tok/s. Note: this benchmark measures speed and resource usage only — no formal listening test (MOS) was conducted, so quality comparisons are based on the models’ published characteristics.

Android TTS Real-Time Factor

iOS Results

Device: iPad Pro 3rd gen, A12X Bionic, 4 GB RAM

iOS TTS Overall Score

ModelEngineOverall ScoreSpeed ScoreMedian tok/sMedian RTFMemory (MB)
AVSpeech (System)native100.00151.3421
Matcha (LJSpeech) + Vocossherpa-onnx87.7794.3925.680.084211
Kitten Nano EN (v0.2 fp16)sherpa-onnx59.7275.455.140.368193
Kitten Nano (en v0.1 fp16)sherpa-onnx58.9072.865.610.407108
Kokoro EN (v0.19)sherpa-onnx43.5958.604.010.621833
Kitten Mini EN (v0.1 fp16)sherpa-onnx24.5724.301.631.135427
VITS LJS (Int8)sherpa-onnx21.410.001.202.023140
VITS VCTK (Int8)sherpa-onnx20.980.001.432.062122
VITS Melo (ZH+EN, Int8)sherpa-onnx20.070.000.832.874211
Kokoro Int8 (Multi-lang v1.0)sherpa-onnx17.060.001.401.822515
Kokoro Multi-lang INT8 (v1.1)sherpa-onnx16.910.001.711.569588

iOS Observations

  • AVSpeech scores highest overall due to negligible memory usage (21 MB vs 100–800 MB for open-source models) and fast synthesis, though voice quality is limited to Apple’s built-in voices.
  • Matcha + Vocos is the best open-source option on iOS — fast (RTF 0.08), high overall score (87.8), with moderate memory at 211 MB.
  • Kitten Nano models offer a good balance — RTF below 0.5 with reasonable memory (108–193 MB).
  • Kokoro EN (v0.19) scores 43.6 overall with RTF 0.62 — faster than real-time but memory-heavy at 833 MB, the largest footprint in this benchmark.
  • VITS and Kokoro Int8 variants all run slower than real-time on iPad (RTF > 1.0), making them impractical for interactive use.

Limitations

  • No voice quality evaluation: This benchmark does not include perceptual quality metrics (MOS, ABX, or listening tests). Quality comparisons reference the models’ published characteristics only.
  • English only: All prompts are in English. Multilingual models (Kokoro multi-lang, VITS Melo ZH+EN) were not evaluated in their other supported languages.
  • Single device per platform: Results are from one Android and one iOS device. Performance may vary on other chipsets.
  • Overall Score excludes quality: The iOS composite score reflects speed and memory efficiency only — a high-scoring model is not necessarily the best-sounding one.

Further Research

  • Perceptual quality study: Run MOS/ABX listening tests with human raters to validate quality claims beyond speed/memory metrics.
  • Multilingual prompt suite: Evaluate non-English synthesis quality and speed for multilingual models, not only English prompts.
  • Prosody and style control: Benchmark controllability (emotion, speaking rate, punctuation sensitivity) for dialogue and assistant use cases.
  • Streaming TTS latency: Measure time-to-first-audio and chunk-level latency for interactive assistants.
  • Compression and mobile scaling: Compare INT8/INT4 and voice-cloning variants under strict mobile RAM budgets.

Conclusion

On-device TTS is viable on both Android and iOS, but model choice depends heavily on the use case. For real-time interactive applications, Android System TTS / Piper (Android) or Matcha + Vocos (iOS) provide fast synthesis (RTF well below 1.0). For pre-generated audio or non-interactive use where latency is acceptable, Kokoro offers richer voice output at the cost of higher synthesis time and memory. System TTS engines remain competitive — Android’s built-in TTS is the fastest option (42 tok/s), while Apple’s AVSpeechSynthesizer scores highest on iOS due to its minimal resource footprint (21 MB).

On iOS, memory consumption varies widely — from 108 MB (Kitten Nano) to 833 MB (Kokoro EN) — which directly impacts whether a model can coexist with other AI workloads on memory-constrained edge devices.

References

Our Repositories:

Models:

  • Kokoro — StyleTTS2-based, high-quality multilingual TTS
  • Piper — Fast VITS-based TTS with many voices
  • Matcha-TTS — Flow-matching TTS with Vocos vocoder
  • Kitten Nano/Mini — Lightweight neural TTS
  • MMS-TTS — Meta’s Massively Multilingual Speech TTS (1,100+ languages)

Inference Engine:

  • sherpa-onnx — Next-gen Kaldi ONNX Runtime (supports TTS models)
Share this article

Try VoicePing for Free

Break language barriers with AI translation. Start with our free plan today.

Get Started Free