Offline Speech Translation App: Cross-Platform On-Device Transcription, Translation, and TTS for iOS and Android

Source Code:

ios-android-offline-speech-translation — Cross-platform offline speech translation app with ASR, translation, TTS, and system audio capture for iOS and Android (Apache 2.0)

Abstract

We present an open-source cross-platform mobile application that performs fully offline speech translation on iOS and Android. The app combines on-device automatic speech recognition (SenseVoice Small via sherpa-onnx), neural machine translation (Apple Translation / Google ML Kit), and text-to-speech synthesis into a unified pipeline. Both platforms support microphone and system audio capture — enabling translation of audio from other apps (video calls, media, etc.) without cloud connectivity, subject to platform capture restrictions. On-device ASR achieves 23.6 tok/s on iPad Pro (A12X) and 33.6 tok/s on Samsung Galaxy S10 (RTF < 0.1). Translation and TTS stages use platform-native engines and were not individually benchmarked in this release.

Motivation

Can a complete speech translation pipeline — ASR, machine translation, and TTS — run reliably offline on consumer phones from 2018–2019? Most on-device AI research benchmarks individual models in isolation, but a real application must chain multiple stages together, manage system audio capture across platform sandboxing restrictions, and handle the full lifecycle (persistence, export, background processing).

This project tests the system feasibility question: whether the full pipeline works end-to-end on real consumer hardware without network connectivity. The app is open-source so developers can evaluate the architecture directly.

App Overview

iOS

Home	Transcription + Translation	Demo

SenseVoice Small with Apple Translation (English → Japanese) and TTS.

Android

Transcription + Translation	Demo

SenseVoice Small with ML Kit Translation and TTS.

Pipeline Architecture

The app implements a complete offline speech translation pipeline:

Stage	Component	Details
Audio Input	Mic / System Audio	Microphone or system audio capture
→ ASR	SenseVoice Small	Speech-to-text via sherpa-onnx (offline)
→ Translation	Apple Translation / Google ML Kit	Neural machine translation (offline)
→ TTS	System TTS	AVSpeechSynthesizer (iOS) / Android TextToSpeech
Audio Output	Speaker	Translated speech playback

Each stage runs entirely on-device with no network dependency at inference time.

Supported Models

iOS:

Model	Engine	Languages
SenseVoice Small	sherpa-onnx offline	zh/en/ja/ko/yue
Apple Speech	SFSpeechRecognizer	50+ languages

Android:

Model	Engine	Languages
SenseVoice Small	sherpa-onnx offline	zh/en/ja/ko/yue
Android Speech (Offline)	SpeechRecognizer (on-device, API 31+)	System languages
Android Speech (Online)	SpeechRecognizer (standard)	System languages

Translation Providers

Platform	Provider	Mode	Coverage
iOS	Apple Translation	Offline (iOS 18+)	20+ language pairs
Android	Google ML Kit	Offline	59 languages
Android	Android System Translation	Offline (API 31+)	System languages

TTS

Platform	Engine
iOS	AVSpeechSynthesizer
Android	Android TextToSpeech

Audio Capture Scope

This app supports both microphone input and system audio capture from other apps (subject to platform restrictions such as DRM and app-level opt-out). To keep this article focused on pipeline behavior and deployment outcomes, low-level capture implementation details are intentionally omitted here. For iOS capture internals, see the offline transcription project: ios-mac-offline-transcribe.

Data Persistence and Export

Both platforms store transcription history locally and support export:

Feature	iOS	Android
Persistence	SwiftData (`TranscriptionRecord`)	Room (`TranscriptionEntity`, `AppDatabase`)
Audio files	`SessionFileManager`	`AudioPlaybackManager`
Export	ZIP export (`ZIPExporter`)	ZIP export (`SessionExporter`)

Limitations

ASR only benchmarked: Only the ASR stage (SenseVoice Small) was benchmarked for speed. Translation and TTS stages use platform-native engines and were not individually measured — end-to-end pipeline latency is unknown.
System audio capture restrictions: Some apps opt out of audio capture, so “other apps” capture is not universal.
Two devices tested: Results are from Galaxy S10 (2019) and iPad Pro 3rd gen (2018). Performance on other devices may vary.
No accuracy evaluation: ASR transcription accuracy (WER) and translation quality were not formally measured in this release.

Further Research

End-to-end latency breakdown: Measure ASR, translation, and TTS stages separately and report full pipeline latency percentiles.
Quality evaluation: Add WER for ASR and translation quality metrics with human validation for common language pairs.
Broader device matrix: Benchmark mid-range and newer NPU-equipped phones to understand scaling across 2018-2026 hardware.
Background reliability: Stress-test long sessions, interruptions, and background execution policies on both OS platforms.
Power and thermals: Quantify battery drain and thermal throttling during continuous translation sessions.

Conclusion

Fully offline speech translation is practical on current mobile hardware. The ASR stage (SenseVoice Small) achieves 23–34 tok/s with RTF < 0.1 on consumer devices from 2019 (Galaxy S10) and 2018 (iPad Pro 3rd gen). Translation and TTS use platform-native engines (Apple Translation / Google ML Kit and system TTS) which were not individually benchmarked — end-to-end pipeline latency depends on utterance length and these downstream stages.

The system audio capture capability extends translation beyond microphone input to other audio sources, enabling translation of video calls, media, and other apps without cloud connectivity.

For edge deployment scenarios where network access is unreliable or where data privacy is paramount, this architecture demonstrates that the full speech translation pipeline can be deployed entirely on-device using consumer hardware. The app is open-source under Apache 2.0 and supports community contributions of additional models and benchmark results.

References

Our Repository:

ios-android-offline-speech-translation — Cross-platform offline speech translation app (Apache 2.0)

ASR Models:

SenseVoice Small — Alibaba’s multilingual ASR model (zh/en/ja/ko/yue)
Parakeet TDT 0.6B v3 — NVIDIA NeMo, 25 European languages

Inference Engine:

sherpa-onnx — Next-gen Kaldi ONNX Runtime for on-device speech processing

Translation:

Apple Translation Framework — On-device translation for iOS 18+
Google ML Kit Translation — Offline translation for Android