Speech Recognition On-Device AI Benchmark Whisper Moonshine Parakeet SenseVoice Qwen3 ASR Offline Transcription Android iOS macOS Windows

Offline Speech Transcription Benchmark: 16 Models Across Android, iOS, macOS, and Windows

Akinori Nakajima - VoicePing 9 min read
Offline Speech Transcription Benchmark: 16 Models Across Android, iOS, macOS, and Windows

Comprehensive benchmark of 16 on-device speech-to-text models across 9 inference engines on Android, iOS, macOS, and Windows

Source Code:

Abstract

We benchmark 16 on-device speech-to-text models across 9 inference engines on Android, iOS, macOS, and Windows, measuring inference speed (tok/s), real-time factor (RTF), and memory footprint. This benchmark measures speed only — transcription accuracy (WER/CER) is not evaluated. Key findings: the choice of inference engine can change performance by 51x for the same model (sherpa-onnx vs whisper.cpp on Android); Moonshine Tiny and SenseVoice Small achieve the fastest inference across platforms; and WhisperKit CoreML crashes on 4 GB iOS devices for models above Whisper Tiny. All benchmark apps and results are open-source.

Motivation

Developers building voice-enabled edge applications face a combinatorial selection problem: dozens of ASR models (from 31 MB Whisper Tiny to 1.8 GB Qwen3 ASR), multiple inference engines (ONNX Runtime, CoreML, whisper.cpp, MLX), and 4+ target platforms — each combination producing vastly different speed and memory characteristics. Published model benchmarks typically report results on server GPUs, not the consumer mobile and laptop hardware where these models actually deploy.

This benchmark addresses the deployment choice problem directly: which model + engine combination delivers real-time transcription on each target platform, and what are the memory constraints? The results enable developers to select a model/engine pair based on measured data from their target device class, rather than extrapolating from GPU benchmarks.

Methodology

Android, iOS, and macOS benchmarks use the same 30-second WAV file containing a looped segment of the JFK inaugural address (16 kHz, mono, PCM 16-bit). Windows benchmarks use an 11-second excerpt from the same source (see Windows section for details).

Metrics (speed-only — this benchmark does not measure transcription accuracy/WER):

  • Inference: Wall-clock time from engine call to result
  • tok/s: Output words per second (higher = faster)
  • RTF: Real-Time Factor — ratio of processing time to audio duration (below 1.0 = faster than real-time)

Devices:

DeviceChipRAMOS
Samsung Galaxy S10Exynos 98208 GBAndroid 12 (API 31)
iPad Pro 3rd genA12X Bionic4 GBiOS 17+
MacBook AirApple M432 GBmacOS 15+
LaptopIntel Core i5-1035G18 GBWindows (CPU-only)

Android Results

Device: Samsung Galaxy S10, Android 12, API 31

Android Inference Speed — Tokens per Second

ModelEngineParamsSizeLanguagesInferencetok/sRTFResult
Moonshine Tinysherpa-onnx27M~125 MBEnglish1,363 ms42.550.05
SenseVoice Smallsherpa-onnx234M~240 MBzh/en/ja/ko/yue1,725 ms33.620.06
Whisper Tinysherpa-onnx39M~100 MB99 languages2,068 ms27.080.07
Moonshine Basesherpa-onnx61M~290 MBEnglish2,251 ms25.770.08
Parakeet TDT 0.6B v3sherpa-onnx600M~671 MB25 European2,841 ms20.410.09
Android Speech (Offline)SpeechRecognizerSystemBuilt-in50+ languages3,615 ms1.380.12
Android Speech (Online)SpeechRecognizerSystemBuilt-in100+ languages3,591 ms1.390.12
Zipformer Streamingsherpa-onnx streaming20M~73 MBEnglish3,568 ms16.260.12
Whisper Base (.en)sherpa-onnx74M~160 MBEnglish3,917 ms14.810.13
Whisper Basesherpa-onnx74M~160 MB99 languages4,038 ms14.360.13
Whisper Smallsherpa-onnx244M~490 MB99 languages12,329 ms4.700.41
Qwen3 ASR 0.6B (ONNX)ONNX Runtime INT8600M~1.9 GB30 languages15,881 ms3.650.53
Whisper Turbosherpa-onnx809M~1.0 GB99 languages17,930 ms3.230.60
Whisper Tiny (whisper.cpp)whisper.cpp GGML39M~31 MB99 languages105,596 ms0.553.52
Qwen3 ASR 0.6B (CPU)Pure C/NEON600M~1.8 GB30 languages338,261 ms0.1711.28
Omnilingual 300Msherpa-onnx300M~365 MB1,600+ languages44,035 ms0.051.47

15/16 PASS — 0 OOM conditions. The sole failure (Omnilingual 300M) is a known model limitation with English language detection.

Android Engine Comparison: Same Model, Different Backends

The Whisper Tiny model shows dramatic performance differences depending on the inference backend:

BackendInferencetok/sSpeedup
sherpa-onnx (ONNX)2,068 ms27.0851x
whisper.cpp (GGML)105,596 ms0.551x (baseline)

sherpa-onnx is 51x faster than whisper.cpp for the same Whisper Tiny model on Android — a critical finding for developers choosing an inference runtime.

iOS Results

Device: iPad Pro 3rd gen, A12X Bionic, 4 GB RAM

iOS Inference Speed — Tokens per Second

ModelEngineParamsSizeLanguagestok/sStatus
Parakeet TDT v3FluidAudio (CoreML)600M~600 MB (CoreML)25 European181.8
Zipformer 20Msherpa-onnx streaming20M~46 MB (INT8)English39.7
Whisper Tinywhisper.cpp39M~31 MB (GGML Q5_1)99 languages37.8
Moonshine Tinysherpa-onnx offline27M~125 MB (INT8)English37.3
Moonshine Basesherpa-onnx offline61M~280 MB (INT8)English31.3
Whisper BaseWhisperKit (CoreML)74M~150 MB (CoreML)English19.6❌ OOM on 4 GB
SenseVoice Smallsherpa-onnx offline234M~240 MB (INT8)zh/en/ja/ko/yue15.6
Whisper Basewhisper.cpp74M~57 MB (GGML Q5_1)99 languages13.8
Whisper SmallWhisperKit (CoreML)244M~500 MB (CoreML)99 languages6.3❌ OOM on 4 GB
Qwen3 ASR 0.6BPure C (ARM NEON)600M~1.8 GB30 languages5.6
Qwen3 ASR 0.6B (ONNX)ONNX Runtime (INT8)600M~1.6 GB (INT8)30 languages5.4
Whisper TinyWhisperKit (CoreML)39M~80 MB (CoreML)99 languages4.5
Whisper Smallwhisper.cpp244M~181 MB (GGML Q5_1)99 languages3.9
Whisper Large v3 Turbo (compressed)WhisperKit (CoreML)809M~1 GB (CoreML)99 languages1.9❌ OOM on 4 GB
Whisper Large v3 TurboWhisperKit (CoreML)809M~600 MB (CoreML)99 languages1.4❌ OOM on 4 GB
Whisper Large v3 Turbowhisper.cpp809M~547 MB (GGML Q5_0)99 languages0.8⚠️ RTF >1
Whisper Large v3 Turbo (compressed)whisper.cpp809M~834 MB (GGML Q8_0)99 languages0.8⚠️ RTF >1

WhisperKit OOM warning: On 4 GB devices, WhisperKit CoreML crashes (OOM) for Whisper Base and above. The tok/s values shown for OOM models were measured before the crash occurred and do not represent complete successful runs. whisper.cpp handles the same models without OOM, though at lower throughput.

macOS Results

Device: MacBook Air M4, 32 GB RAM

macOS Inference Speed — Tokens per Second

ModelEngineParamsSizeLanguagestok/sStatus
Parakeet TDT v3FluidAudio (CoreML)600M~600 MB (CoreML)25 European171.6
Moonshine Tinysherpa-onnx offline27M~125 MB (INT8)English92.2
Zipformer 20Msherpa-onnx streaming20M~46 MB (INT8)English77.4
Moonshine Basesherpa-onnx offline61M~280 MB (INT8)English59.3
SenseVoice Smallsherpa-onnx offline234M~240 MB (INT8)zh/en/ja/ko/yue27.4
Whisper TinyWhisperKit (CoreML)39M~80 MB (CoreML)99 languages24.7
Whisper BaseWhisperKit (CoreML)74M~150 MB (CoreML)English23.3
Apple SpeechSFSpeechRecognizerSystemBuilt-in50+ languages13.1
Whisper SmallWhisperKit (CoreML)244M~500 MB (CoreML)99 languages8.7
Qwen3 ASR 0.6B (ONNX)ONNX Runtime (INT8)600M~1.6 GB (INT8)30 languages8.0
Qwen3 ASR 0.6BPure C (ARM NEON)600M~1.8 GB30 languages5.7
Whisper Large v3 TurboWhisperKit (CoreML)809M~600 MB (CoreML)99 languages1.9
Whisper Large v3 Turbo (compressed)WhisperKit (CoreML)809M~1 GB (CoreML)99 languages1.5
Qwen3 ASR 0.6B (MLX)MLX (Metal GPU)600M~400 MB (4-bit)30 languagesNot benchmarked
Omnilingual 300Msherpa-onnx offline300M~365 MB (INT8)1,600+ languages0.03❌ English broken

macOS has no OOM issues — with 32 GB RAM, all models including Whisper Large v3 Turbo run successfully via WhisperKit CoreML.

Windows Results

Device: Intel Core i5-1035G1 @ 1.00 GHz (4C/8T), 8 GB RAM, CPU-only

Windows Inference Speed — Words per Second

ModelEngineParamsSizeInferenceWords/sRTF
Moonshine Tinysherpa-onnx offline27M~125 MB435 ms50.60.040
SenseVoice Smallsherpa-onnx offline234M~240 MB462 ms47.60.042
Moonshine Basesherpa-onnx offline61M~290 MB534 ms41.20.049
Parakeet TDT v2sherpa-onnx offline600M~660 MB1,239 ms17.80.113
Zipformer 20Msherpa-onnx streaming20M~73 MB1,775 ms12.40.161
Whisper Tinywhisper.cpp39M~80 MB2,325 ms9.50.211
Omnilingual 300Msherpa-onnx offline300M~365 MB2,360 ms0.215
Whisper Basewhisper.cpp74M~150 MB6,501 ms3.40.591
Qwen3 ASR 0.6Bqwen-asr (C)600M~1.9 GB13,359 ms1.61.214
Whisper Smallwhisper.cpp244M~500 MB21,260 ms1.01.933
Whisper Large v3 Turbowhisper.cpp809M~834 MB92,845 ms0.28.440
Windows SpeechWindows Speech APIN/A0 MB

Tested with an 11-second JFK inauguration audio excerpt (22 words). All models run CPU-only on x86_64 (i5-1035G1, 4C/8T) — no GPU acceleration. Parakeet TDT v2 is notable for combining fast inference (17.8 words/s) with full punctuation output.

Limitations

  • Speed only: This benchmark measures inference speed, not transcription accuracy (WER/CER). Accuracy varies by model, language, and audio conditions — developers should evaluate accuracy separately for their target use case.
  • Single audio sample: All platforms use a single JFK inaugural address recording. Results may differ with other audio characteristics (noise, accents, domain-specific vocabulary).
  • Windows clip length: Windows uses an 11-second audio clip vs 30 seconds on other platforms, so cross-platform speed comparisons should account for this difference.
  • iOS OOM values: The tok/s values shown for OOM-marked iOS models were measured before the crash and do not represent complete successful runs.

Further Research

  • Add accuracy benchmarks: Pair speed metrics with WER/CER across multilingual datasets and noisy speech conditions.
  • Expand audio conditions: Add long-form audio, overlapping speakers, and domain vocabulary (meetings, call-center, industrial) instead of one speech sample.
  • Quantization sweep: Benchmark INT8/INT4 and mixed-precision variants across engines to map memory/speed/accuracy trade-offs.
  • Smaller Qwen3-ASR variants: Include Qwen3-ASR-0.6B and any future sub-1B checkpoints (for example, a potential 0.8B release) to test whether speed gains justify quality loss.
  • Power and thermal profiling: Add battery drain and sustained-performance measurements for continuous on-device transcription workloads.

Conclusion

On-device speech transcription achieves faster-than-real-time speeds across all major mobile and desktop platforms. The choice of inference engine matters as much as model selection — sherpa-onnx on Android and CoreML on Apple deliver 10–50x speedups over naive CPU inference. Among the fastest models, Moonshine Tiny (English) or Whisper Tiny (multilingual) with the right engine achieves real-time transcription at minimal memory cost. Model selection should also consider transcription accuracy (WER), which was not measured in this benchmark.

References

Our Repositories:

Models:

Inference Engines:

Share this article

Try VoicePing for Free

Break language barriers with AI translation. Start with our free plan today.

Get Started Free