Offline Text-to-Speech Benchmark: 18 Models Across Android and iOS

Source Code:

android-offline-tts-eval — Android TTS benchmark app with 7 models via sherpa-onnx and system TTS
ios-offline-tts-eval — iOS TTS benchmark app with 11 models including AVSpeech

Abstract

We benchmark 18 on-device text-to-speech models across 2 inference engines on Android (7 models) and iOS (11 models), measuring synthesis speed (tok/s), real-time factor (RTF), and memory usage. All benchmarks use English text prompts only. Results show Android System TTS and Piper VITS achieve the fastest synthesis (33–42 tok/s on Android), while Kokoro runs slower than real-time on both platforms. On iOS, Apple’s built-in AVSpeechSynthesizer scores highest overall due to minimal memory overhead, but Matcha + Vocos provides the best balance of speed and resource efficiency among open-source models. No formal listening test (MOS/ABX) was conducted — this benchmark measures speed and resource usage, not perceptual voice quality.

Motivation

Developers integrating TTS into mobile and edge applications face a three-way trade-off: synthesis latency (can the model keep up with real-time interaction?), memory footprint (can it coexist with ASR and other AI models on a memory-constrained device?), and voice quality (is the output acceptable for the use case?). TTS models range from 440 ms system TTS to 15-second Kokoro — a 35x speed difference — with memory from 21 MB to 833 MB.

Existing TTS comparisons typically evaluate quality (MOS scores) on server hardware, but for on-device deployment the binding constraints are speed and memory, not quality alone. A model that sounds excellent but takes 15 seconds to synthesize a sentence is unusable for interactive applications. This benchmark provides the speed and memory data developers need to make the UX trade-off decision: which model delivers acceptable latency on real mobile hardware?

Methodology

Both platforms use a standardized set of 12 English text prompts of varying length and complexity. All results in this benchmark reflect English synthesis performance only — multilingual models were not evaluated in other languages. Each model is evaluated in warm mode (1 warm-up iteration) to measure steady-state performance.

Metrics:

tok/s: Output tokens (words) synthesized per second (higher = faster)
RTF: Real-Time Factor — ratio of synthesis time to audio duration (below 1.0 = faster than real-time)
Overall Score: Composite metric (iOS only, 0–100 scale) = weighted combination of Speed Score (tok/s normalized + RTF penalty) and Memory Score (inverse of memory usage). Models with RTF > 1.5 receive a Speed Score of 0. Full formula in ios-offline-tts-eval source. This score does not include voice quality — no formal listening test (MOS/ABX) was conducted.

Devices:

Device	Chip	RAM	OS
Samsung Galaxy S10	Exynos 9820	8 GB	Android 12 (API 31)
iPad Pro 3rd gen	A12X Bionic	4 GB	iPadOS 17+

Android Results

Device: Samsung Galaxy S10, Android 12, API 31, 4 threads

Android TTS Inference Speed — Tokens per Second

Model	Engine	Median Synth (ms)	Median tok/s	Median RTF	Status
Android System TTS	android_system_tts	440	42.48	0.058	PASS
Piper (ryan-low)	sherpa-onnx	478	39.14	0.077	PASS
Piper (amy-low)	sherpa-onnx	524	33.39	0.076	PASS
Matcha-Icefall (LJSpeech + HiFiGAN)	sherpa-onnx	1,104	16.37	0.135	PASS
Kitten Nano (en v0.2 fp16)	sherpa-onnx	3,526	5.18	0.387	PASS
Kokoro (en v0.19)	sherpa-onnx	8,226	2.37	1.133	PASS
Kokoro Int8 (multi-lang v1.1)	sherpa-onnx	15,343	1.25	2.423	PASS

All 7 models PASS — no crashes or OOM conditions on 8 GB device.

Android Speed Observations

Android System TTS and Piper VITS models are the fastest (33–42 tok/s). Kokoro models run slower than real-time (RTF > 1.0) but are designed for higher voice quality. Matcha-Icefall offers a middle ground at 16 tok/s. Note: this benchmark measures speed and resource usage only — no formal listening test (MOS) was conducted, so quality comparisons are based on the models’ published characteristics.

Android TTS Real-Time Factor

iOS Results

Device: iPad Pro 3rd gen, A12X Bionic, 4 GB RAM

iOS TTS Overall Score

Model	Engine	Overall Score	Speed Score	Median tok/s	Median RTF	Memory (MB)
AVSpeech (System)	native	100.00	—	151.34	—	21
Matcha (LJSpeech) + Vocos	sherpa-onnx	87.77	94.39	25.68	0.084	211
Kitten Nano EN (v0.2 fp16)	sherpa-onnx	59.72	75.45	5.14	0.368	193
Kitten Nano (en v0.1 fp16)	sherpa-onnx	58.90	72.86	5.61	0.407	108
Kokoro EN (v0.19)	sherpa-onnx	43.59	58.60	4.01	0.621	833
Kitten Mini EN (v0.1 fp16)	sherpa-onnx	24.57	24.30	1.63	1.135	427
VITS LJS (Int8)	sherpa-onnx	21.41	0.00	1.20	2.023	140
VITS VCTK (Int8)	sherpa-onnx	20.98	0.00	1.43	2.062	122
VITS Melo (ZH+EN, Int8)	sherpa-onnx	20.07	0.00	0.83	2.874	211
Kokoro Int8 (Multi-lang v1.0)	sherpa-onnx	17.06	0.00	1.40	1.822	515
Kokoro Multi-lang INT8 (v1.1)	sherpa-onnx	16.91	0.00	1.71	1.569	588

iOS Observations

AVSpeech scores highest overall due to negligible memory usage (21 MB vs 100–800 MB for open-source models) and fast synthesis, though voice quality is limited to Apple’s built-in voices.
Matcha + Vocos is the best open-source option on iOS — fast (RTF 0.08), high overall score (87.8), with moderate memory at 211 MB.
Kitten Nano models offer a good balance — RTF below 0.5 with reasonable memory (108–193 MB).
Kokoro EN (v0.19) scores 43.6 overall with RTF 0.62 — faster than real-time but memory-heavy at 833 MB, the largest footprint in this benchmark.
VITS and Kokoro Int8 variants all run slower than real-time on iPad (RTF > 1.0), making them impractical for interactive use.

Limitations

No voice quality evaluation: This benchmark does not include perceptual quality metrics (MOS, ABX, or listening tests). Quality comparisons reference the models’ published characteristics only.
English only: All prompts are in English. Multilingual models (Kokoro multi-lang, VITS Melo ZH+EN) were not evaluated in their other supported languages.
Single device per platform: Results are from one Android and one iOS device. Performance may vary on other chipsets.
Overall Score excludes quality: The iOS composite score reflects speed and memory efficiency only — a high-scoring model is not necessarily the best-sounding one.

Further Research

Perceptual quality study: Run MOS/ABX listening tests with human raters to validate quality claims beyond speed/memory metrics.
Multilingual prompt suite: Evaluate non-English synthesis quality and speed for multilingual models, not only English prompts.
Prosody and style control: Benchmark controllability (emotion, speaking rate, punctuation sensitivity) for dialogue and assistant use cases.
Streaming TTS latency: Measure time-to-first-audio and chunk-level latency for interactive assistants.
Compression and mobile scaling: Compare INT8/INT4 and voice-cloning variants under strict mobile RAM budgets.

Conclusion

On-device TTS is viable on both Android and iOS, but model choice depends heavily on the use case. For real-time interactive applications, Android System TTS / Piper (Android) or Matcha + Vocos (iOS) provide fast synthesis (RTF well below 1.0). For pre-generated audio or non-interactive use where latency is acceptable, Kokoro offers richer voice output at the cost of higher synthesis time and memory. System TTS engines remain competitive — Android’s built-in TTS is the fastest option (42 tok/s), while Apple’s AVSpeechSynthesizer scores highest on iOS due to its minimal resource footprint (21 MB).

On iOS, memory consumption varies widely — from 108 MB (Kitten Nano) to 833 MB (Kokoro EN) — which directly impacts whether a model can coexist with other AI workloads on memory-constrained edge devices.

References

Our Repositories:

android-offline-tts-eval — Android TTS benchmark app (Apache 2.0)
ios-offline-tts-eval — iOS TTS benchmark app (Apache 2.0)

Models:

Kokoro — StyleTTS2-based, high-quality multilingual TTS
Piper — Fast VITS-based TTS with many voices
Matcha-TTS — Flow-matching TTS with Vocos vocoder
Kitten Nano/Mini — Lightweight neural TTS
MMS-TTS — Meta’s Massively Multilingual Speech TTS (1,100+ languages)

Inference Engine:

sherpa-onnx — Next-gen Kaldi ONNX Runtime (supports TTS models)