Speech Translation On-Device AI Offline Translation SenseVoice sherpa-onnx iOS Android Edge AI System Audio Capture Cross-Platform

Offline Speech Translation App: Cross-Platform On-Device Transcription, Translation, and TTS for iOS and Android

Akinori Nakajima - VoicePing 5 min read
Offline Speech Translation App: Cross-Platform On-Device Transcription, Translation, and TTS for iOS and Android

Open-source cross-platform mobile app for fully offline speech translation — combining on-device ASR (SenseVoice), neural machine translation, and TTS on iOS and Android with system audio capture

Source Code:

Abstract

We present an open-source cross-platform mobile application that performs fully offline speech translation on iOS and Android. The app combines on-device automatic speech recognition (SenseVoice Small via sherpa-onnx), neural machine translation (Apple Translation / Google ML Kit), and text-to-speech synthesis into a unified pipeline. Both platforms support microphone and system audio capture — enabling translation of audio from other apps (video calls, media, etc.) without cloud connectivity, subject to platform capture restrictions. On-device ASR achieves 23.6 tok/s on iPad Pro (A12X) and 33.6 tok/s on Samsung Galaxy S10 (RTF < 0.1). Translation and TTS stages use platform-native engines and were not individually benchmarked in this release.

Motivation

Can a complete speech translation pipeline — ASR, machine translation, and TTS — run reliably offline on consumer phones from 2018–2019? Most on-device AI research benchmarks individual models in isolation, but a real application must chain multiple stages together, manage system audio capture across platform sandboxing restrictions, and handle the full lifecycle (persistence, export, background processing).

This project tests the system feasibility question: whether the full pipeline works end-to-end on real consumer hardware without network connectivity. The app is open-source so developers can evaluate the architecture directly.

App Overview

iOS

HomeTranscription + TranslationDemo
iOS HomeiOS Transcription + TranslationiOS Demo

SenseVoice Small with Apple Translation (English → Japanese) and TTS.

Android

Transcription + TranslationDemo
Android Transcription + TranslationAndroid Demo

SenseVoice Small with ML Kit Translation and TTS.

Pipeline Architecture

The app implements a complete offline speech translation pipeline:

StageComponentDetails
Audio InputMic / System AudioMicrophone or system audio capture
→ ASRSenseVoice SmallSpeech-to-text via sherpa-onnx (offline)
→ TranslationApple Translation / Google ML KitNeural machine translation (offline)
→ TTSSystem TTSAVSpeechSynthesizer (iOS) / Android TextToSpeech
Audio OutputSpeakerTranslated speech playback

Each stage runs entirely on-device with no network dependency at inference time.

Supported Models

iOS:

ModelEngineLanguages
SenseVoice Smallsherpa-onnx offlinezh/en/ja/ko/yue
Apple SpeechSFSpeechRecognizer50+ languages

Android:

ModelEngineLanguages
SenseVoice Smallsherpa-onnx offlinezh/en/ja/ko/yue
Android Speech (Offline)SpeechRecognizer (on-device, API 31+)System languages
Android Speech (Online)SpeechRecognizer (standard)System languages

Translation Providers

PlatformProviderModeCoverage
iOSApple TranslationOffline (iOS 18+)20+ language pairs
AndroidGoogle ML KitOffline59 languages
AndroidAndroid System TranslationOffline (API 31+)System languages

TTS

PlatformEngine
iOSAVSpeechSynthesizer
AndroidAndroid TextToSpeech

Audio Capture Scope

This app supports both microphone input and system audio capture from other apps (subject to platform restrictions such as DRM and app-level opt-out). To keep this article focused on pipeline behavior and deployment outcomes, low-level capture implementation details are intentionally omitted here. For iOS capture internals, see the offline transcription project: ios-mac-offline-transcribe.

Data Persistence and Export

Both platforms store transcription history locally and support export:

FeatureiOSAndroid
PersistenceSwiftData (TranscriptionRecord)Room (TranscriptionEntity, AppDatabase)
Audio filesSessionFileManagerAudioPlaybackManager
ExportZIP export (ZIPExporter)ZIP export (SessionExporter)

Limitations

  • ASR only benchmarked: Only the ASR stage (SenseVoice Small) was benchmarked for speed. Translation and TTS stages use platform-native engines and were not individually measured — end-to-end pipeline latency is unknown.
  • System audio capture restrictions: Some apps opt out of audio capture, so “other apps” capture is not universal.
  • Two devices tested: Results are from Galaxy S10 (2019) and iPad Pro 3rd gen (2018). Performance on other devices may vary.
  • No accuracy evaluation: ASR transcription accuracy (WER) and translation quality were not formally measured in this release.

Further Research

  • End-to-end latency breakdown: Measure ASR, translation, and TTS stages separately and report full pipeline latency percentiles.
  • Quality evaluation: Add WER for ASR and translation quality metrics with human validation for common language pairs.
  • Broader device matrix: Benchmark mid-range and newer NPU-equipped phones to understand scaling across 2018-2026 hardware.
  • Background reliability: Stress-test long sessions, interruptions, and background execution policies on both OS platforms.
  • Power and thermals: Quantify battery drain and thermal throttling during continuous translation sessions.

Conclusion

Fully offline speech translation is practical on current mobile hardware. The ASR stage (SenseVoice Small) achieves 23–34 tok/s with RTF < 0.1 on consumer devices from 2019 (Galaxy S10) and 2018 (iPad Pro 3rd gen). Translation and TTS use platform-native engines (Apple Translation / Google ML Kit and system TTS) which were not individually benchmarked — end-to-end pipeline latency depends on utterance length and these downstream stages.

The system audio capture capability extends translation beyond microphone input to other audio sources, enabling translation of video calls, media, and other apps without cloud connectivity.

For edge deployment scenarios where network access is unreliable or where data privacy is paramount, this architecture demonstrates that the full speech translation pipeline can be deployed entirely on-device using consumer hardware. The app is open-source under Apache 2.0 and supports community contributions of additional models and benchmark results.

References

Our Repository:

ASR Models:

Inference Engine:

  • sherpa-onnx — Next-gen Kaldi ONNX Runtime for on-device speech processing

Translation:

Share this article

Try VoicePing for Free

Break language barriers with AI translation. Start with our free plan today.

Get Started Free