
Benchmarking bidirectional English-Japanese speech translation models — Qwen3-ASR (1.7B, highest quality) vs distilled Whisper (756M, 4x faster) — against OpenAI Whisper large-v3 and Meta SeamlessM4T v2
Models (Hugging Face):
- voiceping-ai/qwen3-asr-ja-en-speech-translation — Qwen3-ASR fine-tuned for bidirectional EN↔JA speech translation (1.7B params, highest quality in its eval)
- voiceping-ai/whisper-ja-en-speech-translation — Distilled Whisper for bidirectional EN↔JA speech translation (756M params, 4x faster)
Inference Code (GitHub):
- qwen3-asr-ja-en-speech-translation — Inference scripts, evaluation pipeline, and model usage examples
- whisper-ja-en-speech-translation — Inference scripts, evaluation pipeline, and model usage examples
Training scripts are not published in these repositories.
Abstract
We present two bidirectional English-Japanese speech translation models: (1) Qwen3-ASR EN-JA (1.7B parameters), fine-tuned from Qwen3-ASR-1.7B via full-parameter SFT on ~1.27M translation pairs, scoring 4.2/5 EN→JA and 4.0/5 JA→EN in its evaluation; and (2) Whisper EN-JA (756M parameters), a distilled Whisper large-v2 with a 4-layer decoder, achieving 212 tok/s — 4.6x faster than Qwen3-ASR. Both models are evaluated on the FLEURS test set alongside OpenAI Whisper large-v3 and Meta SeamlessM4T v2 Large. Quality was scored by an LLM judge (Claude Opus 4.6). Each repo runs its own evaluation independently; scores are not directly comparable across tables.
Motivation
For EN↔JA speech translation specifically, developers face a decision between two trade-off profiles: high-quality translation (important for business communication and safety-critical contexts) vs high-speed translation (required for real-time interactive use on edge devices). Existing models either lack bidirectional EN↔JA support (Whisper large-v3 cannot produce Japanese output) or are too slow for on-device deployment.
This research quantifies the speed-quality trade-off for this specific language pair by fine-tuning two architectures — one optimized for quality (Qwen3-ASR, 1.7B), one for speed (distilled Whisper, 756M) — and benchmarking both against established baselines. The results enable developers to make an informed choice based on their deployment constraints.
Evaluation Methodology
Both models are evaluated on FLEURS test samples for both translation directions. Quality scored on a 1–5 scale (accuracy + fluency) by an LLM judge (Claude Opus 4.6). Speed measured on NVIDIA RTX PRO 6000 with bfloat16. Each repo runs its own evaluation independently — scores are not directly comparable across tables. Full methodology details in the Appendix.
Results
Qwen3-ASR Evaluation (4-Model Comparison)
| Model | Parameters | EN→JA | JA→EN | Speed (tok/s) |
|---|---|---|---|---|
| OpenAI Whisper large-v3 | 1.55B | N/A | 3.2/5 | 51.0 |
| Meta SeamlessM4T v2 Large | 1.50B | 3.8/5 | 3.0/5 | 48.6 |
| Whisper EN-JA Translation (ours) | 756M | 2.6/5 | 2.4/5 | 212.1 |
| Qwen3-ASR EN-JA Translation (ours) | 1.7B | 4.2/5 | 4.0/5 | 45.8 |
Quality scored on FLEURS test samples (1–5 scale: accuracy + fluency). Speed benchmarked on NVIDIA GPU with bfloat16. All scores from the Qwen3-ASR repo evaluation.
Whisper EN-JA Evaluation (3-Model Comparison)
| Model | Parameters | EN→JA | JA→EN | Speed (tok/s) |
|---|---|---|---|---|
| OpenAI Whisper large-v3 | 1.55B | N/A | 3.6/5 | 51.0 |
| Meta SeamlessM4T v2 Large | 1.50B | 3.8/5 | 4.4/5 | 48.6 |
| Whisper EN-JA Translation (ours) | 756M | 3.4/5 | 3.4/5 | 212.1 |
All scores from the Whisper EN-JA repo evaluation. Baseline scores differ from the table above due to different evaluation samples and scoring methodology.
Speed Comparison
| Model | Parameters | Speed (tok/s) | Relative Speed |
|---|---|---|---|
| Qwen3-ASR EN-JA (ours) | 1.7B | 45.8 | 1.0x |
| Meta SeamlessM4T v2 Large | 1.50B | 48.6 | 1.1x |
| OpenAI Whisper large-v3 | 1.55B | 51.0 | 1.1x |
| Whisper EN-JA (ours) | 756M | 212.1 | 4.6x |
The distilled Whisper model achieves 4.6x the throughput of Qwen3-ASR with half the parameters (756M vs 1.7B), making it suitable for latency-sensitive deployments.
Model Architectures
- Qwen3-ASR EN-JA (1.7B) — Quality-optimized. Full-parameter SFT from Qwen3-ASR-1.7B on ~1.27M EN↔JA translation pairs. Audio encoder + language model architecture.
- Whisper EN-JA (756M) — Speed-optimized. Distilled from Whisper large-v2: full 32-layer encoder (frozen) + 4-layer decoder (reduced from 32), enabling 4.6x faster inference at half the parameter count.
Full training hyperparameters, VRAM usage, and inference details are in the Appendix below.
Translation Examples
Side-by-side comparison on FLEURS test set samples:
EN → JA
| Source (EN audio) | SeamlessM4T v2 | Whisper EN-JA (ours) | Qwen3-ASR EN-JA (ours) |
|---|---|---|---|
| through the night between 150 and 200 copies were made now known as dunlap broadsides | 今ではダンラップ・ブロードサイズとして知られる 150 枚から 200 枚のコピーが作られました | 百五十から二百までのコピーが生成されました。 | 150から200本のコピーが作られました。これらは今では「Dunlap Broadside」として知られています。 |
| the find also grants insight into the evolution of feathers in birds | この発見は羽の進化と鳥の進化についても洞察を深めます | この発見は、鳥の羽の形にも影響を与えます。 | この発見は、羽や鳥の進化についても洞察を与えてくれます。 |
| many people don’t think about them as dinosaurs because they have feathers and can fly | 羽があって飛べるから 恐とは思わない人も多い | 多くの人々は、恐竜とは思わない。なぜなら、恐竜には羽があり、飛ぶことができるからです。 | (not in Qwen3 eval set) |
JA → EN
| Source (JA audio) | Whisper large-v3 | Whisper EN-JA (ours) | Qwen3-ASR EN-JA (ours) |
|---|---|---|---|
| 残念ながら 運転手の行動を100%の確率で予測することはできないので 交通の流れを研究することは困難です | Unfortunately, it is difficult to predict the behavior of the driver with 100% probability, so it is difficult to study the flow of traffic. | Unfortunately, it’s impossible to predict the behavior of drivers with 100% certainty. Therefore, it’s difficult to study the flow of traffic. | Unfortunately, it’s not possible to predict the actions of drivers with 100% certainty. Therefore, it’s difficult to study the flow of traffic. |
| 夜空の下でピラミッドに映像が浮かび ピラミッドが次々とライトアップされます | The image of the pyramid is reflected under the night sky, and the pyramid is lighted up one after another. | The pyramid is lit up one after another. | Under the night sky, images of pyramids are illuminated. The pyramids are lit up one after another. |
| キルヒネル夫人は アルゼンチン劇場で大統領選に出馬するという意向を表明しました | Mrs. Kirchner expressed her intention to appear in the presidential election at the Argentine theater. | Kylhine announced her intention to run for president in the Argentina theater. | Mrs. Kirchner announced her intention to run for president at the Alzen Theatre. |
In the examples above, Qwen3-ASR produces more complete and contextually accurate translations, particularly for EN→JA where it preserves proper nouns and sentence structure. Whisper EN-JA tends to omit details but maintains fluency. Note: one EN→JA sample was not in the Qwen3 evaluation set, so the comparison is partial.
Limitations
- LLM-judged quality: Scores were generated by Claude Opus 4.6, not human raters. LLM judges may have systematic biases that differ from human translation quality assessment.
- Separate evaluation runs: The two repos use different FLEURS subsets and different scoring runs, so baseline scores (e.g., SeamlessM4T) differ between tables and are not directly comparable across them.
- Single language pair: Results apply to EN↔JA only and may not generalize to other language pairs.
- GPU-only speed: Speed was measured on NVIDIA GPU with bfloat16. On-device (mobile CPU/NPU) performance will differ significantly.
Further Research
- Smaller quality-oriented models: Evaluate smaller Qwen3-ASR variants (for example, Qwen3-ASR-0.6B) and future sub-1B variants (such as a potential 0.8B release) to quantify quality/speed/memory trade-offs against 1.7B.
- One shared eval split: Re-run all models on one fixed FLEURS subset with one scoring pipeline so quality numbers become directly comparable.
- Human evaluation protocol: Add bilingual human raters and inter-rater agreement reporting to validate LLM-judge results.
- Simultaneous translation latency: Measure translation lag (not only tok/s) using streaming/simultaneous metrics for real-time usage.
- On-device deployment benchmark: Reproduce the same comparison on Android/iOS NPUs and CPUs, including memory and energy usage.
Conclusion
Speech translation for EN↔JA is achievable with two distinct trade-off profiles:
Speed-first: Whisper EN-JA (756M) is the practical choice for real-time applications — 212 tok/s with half the parameters of comparable models, achieving 4.6x higher throughput than Qwen3-ASR.
Accuracy-first: Qwen3-ASR EN-JA (1.7B) is the better choice when translation quality is the priority — scoring 4.2/5 EN→JA and 4.0/5 JA→EN in its own evaluation run.
Neither baseline model covers both directions well: Whisper large-v3 cannot translate EN→JA at all (English-only output), and both our models provide balanced bidirectional translation from a single model.
Quality scores come from each model’s own evaluation run with LLM-based judging (Claude Opus 4.6) — the two repos use different FLEURS samples and scoring runs, so baseline scores are not directly comparable across tables. For deployment decisions, the translation examples section above provides a more concrete side-by-side comparison of actual output quality.
Appendix: Training, Evaluation, and Hardware Details
GPU / VRAM
GPU: NVIDIA RTX PRO 6000 Blackwell Max-Q (98 GB), bfloat16
| Model | Parameters | Peak VRAM | Speed (tok/s) |
|---|---|---|---|
| Whisper EN-JA (ours) | 756M | 1.56 GB | 212.1 |
| SeamlessM4T v2 Large | 1.50B | 2.89 GB | 48.6 |
| OpenAI Whisper large-v3 | 1.55B | 3.13 GB | 51.0 |
| Qwen3-ASR EN-JA (ours) | 1.7B | ~4 GB* | 45.8 |
* Qwen3-ASR VRAM measurement pending (also runs on CPU).
Qwen3-ASR EN-JA (1.7B) — Training
| Parameter | Value |
|---|---|
| Base model | Qwen3-ASR-1.7B |
| Fine-tuning method | Full-parameter SFT |
| Training data | ~1.27M paired audio-text translation samples (EN↔JA) |
| Optimizer | AdamW |
| Learning rate | 1e-5 |
| LR scheduler | Cosine with warmup (3% warmup) |
| Effective batch size | 64 (batch 8 × grad accumulation 8) |
| Training epochs | ~1.3 |
| Best checkpoint | Epoch 1.16 (by eval loss) |
| Precision | bfloat16 |
| Max audio length | 30 seconds |
Translation direction controlled via language parameter (target output language):
language="Japanese"→ EN audio → JA textlanguage="English"→ JA audio → EN text
Whisper EN-JA (756M) — Training
| Parameter | Value |
|---|---|
| Base architecture | Whisper large-v2 (distilled) |
| Encoder layers | 32 (full, frozen during training) |
| Decoder layers | 4 (reduced from 32) |
| Hidden size | 1280 |
| Total parameters | ~756M |
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| LR scheduler | Cosine with restarts |
| Batch size | 72 |
| Training epochs | 20 |
| Label smoothing | 0.1 |
| Encoder | Frozen (pre-trained representations preserved) |
| Gradient checkpointing | Enabled |
| Max audio length | 30 seconds |
Translation direction controlled via forced_decoder_ids (source audio language):
language="en"+task="translate"→ EN audio → JA textlanguage="ja"+task="translate"→ JA audio → EN text
Evaluation Methodology
| Parameter | Value |
|---|---|
| Dataset | FLEURS test set (both translation directions) |
| Quality scoring | 1–5 scale (accuracy + fluency), LLM judge (Claude Opus 4.6) |
| Speed | Tokens per second on NVIDIA GPU with bfloat16 |
| Evaluation runs | Separate per repo — different FLEURS subsets, scores not directly comparable across repos |
Text normalization:
- English: BasicTextNormalizer (lowercase, remove punctuation)
- Japanese: Morphological tokenization with Kanji display-form normalization
Inference Speed Summary
| Model | Parameters | Speed (tok/s) | Relative |
|---|---|---|---|
| Qwen3-ASR EN-JA (ours) | 1.7B | 45.8 | 1.0x |
| Meta SeamlessM4T v2 Large | 1.50B | 48.6 | 1.1x |
| OpenAI Whisper large-v3 | 1.55B | 51.0 | 1.1x |
| Whisper EN-JA (ours) | 756M | 212.1 | 4.6x |
Qwen3-ASR inference runs on CPU (also supports GPU). Whisper EN-JA inference runs on CPU or GPU.
References
Our Models:
- Qwen3-ASR EN-JA Translation — 1.7B params (Apache 2.0)
- Whisper EN-JA Translation — 756M params, 4x faster (Apache 2.0)
Base Models:
- Qwen3-ASR-1.7B — Qwen3 automatic speech recognition model
- OpenAI Whisper large-v3 — Large-scale speech recognition and translation
- Meta SeamlessM4T v2 Large — Massively multilingual speech translation
Evaluation Dataset:
- FLEURS — Few-shot Learning Evaluation of Universal Representations of Speech


