
Benchmarking five emotional text-to-speech models for Japanese and Chinese across six target emotions, with SenseVoice emotion recognition, emotion2vec anchors, CER, naturalness, runtime, and listening examples.
Models and references:
- Qwen3-TTS CustomVoice 1.7B - custom-voice TTS with explicit emotional prompting.
- CosyVoice 300M Instruct / CosyVoice2 - instruction-style TTS baseline with named Japanese and Chinese speakers.
- Fish Audio S1-mini - expressive TTS model with inline emotion markers.
- VoxCPM2 - multilingual prompt-driven TTS model.
- IndexTTS-2 - emotional zero-shot TTS model evaluated here as an experimental Japanese/Chinese comparison.
Abstract
We benchmarked five emotional text-to-speech systems for Japanese and Chinese across six target emotions: neutral, happy, sad, angry, fear, and disgust. The evaluation uses neutral prompts so the requested emotion must come from speech style, not from emotionally loaded text. Each model generated 120 samples, for a 600-WAV main benchmark corpus across the five completed systems.
The strongest balanced candidate is Qwen3-TTS CustomVoice 1.7B: it has the best pooled SenseVoice accuracy among models with trustworthy Japanese and Chinese text output, the lowest mean CER, the best anchor hit rate, and strong NISQA-TTS naturalness. CosyVoice 300M Instruct is the naturalness leader, but emotion recognition is weak, especially in Japanese. IndexTTS-2 reaches a high pooled SenseVoice score, but its Japanese CER is too high to treat that result as reliable Japanese TTS.
The most important pattern is language and emotion imbalance: Chinese is consistently easier than Japanese in this automatic setup, while fear and disgust remain unsolved across all evaluated models.
Motivation
Emotional TTS is not just a naturalness problem. A model can sound fluent and pleasant while failing to express the requested style. For product use cases such as multilingual avatars, customer support voices, training simulations, or expressive speech translation, we need to know whether a TTS system can keep three things aligned at once:
- It says the intended Japanese or Chinese sentence.
- It sounds natural enough to listen to.
- It expresses the requested emotion rather than collapsing into neutral speech or a nearby emotion.
CLAP-style audio-text similarity is useful for broad retrieval, but it is too indirect for a six-label emotional TTS benchmark. This evaluation combines discrete emotion recognition, continuous emotion anchors, transcription correctness, naturalness predictors, runtime, and listening samples. The goal is not to declare a final production winner from automatic metrics alone; it is to screen models and identify which systems deserve human listening tests.
Evaluation Methodology
The benchmark uses a balanced generation grid across language, emotion, and prompt text:
The same sentence is reused across all six emotions. This keeps the task clean: if a Japanese sentence says “The meeting starts at 10 a.m.” or a Chinese sentence says “The documents are on the desk,” the model cannot rely on emotional text content. It must express the requested emotion through speech.
Prompt Set
Example Japanese prompts:
| ID | Sentence |
|---|---|
ja_001 | 会議は午前十時に始まります。 |
ja_002 | 資料は机の上に置いてあります。 |
ja_003 | 明日の予定を確認してください。 |
ja_004 | 電車は三番線から出発します。 |
ja_005 | 受付で名前を伝えてください。 |
Example Chinese prompts:
| ID | Sentence |
|---|---|
zh_001 | 会议将在上午十点开始。 |
zh_002 | 资料已经放在桌子上。 |
zh_003 | 请确认明天的日程安排。 |
zh_004 | 列车将从三号站台出发。 |
zh_005 | 请在前台告知您的姓名。 |
Emotion Controls
| Target emotion | Control text |
|---|---|
neutral | Speak in a clear, neutral, natural voice. |
happy | Speak in a happy, warm, bright voice. |
sad | Speak in a sad, soft, slow, gentle voice. |
angry | Speak in an angry, tense, forceful voice. |
fear | Speak in a fearful, tense, trembling voice. |
disgust | Speak in a disgusted, displeased, rejecting voice. |
Each model receives the same target label and text, but the actual control interface is model-specific:
| Model | Speaker/reference input used | Emotion control |
|---|---|---|
qwen3_tts_customvoice_1_7b | Predefined CustomVoice speaker Ryan. | Raw sentence plus natural-language control instruction. |
cosyvoice_300m_instruct | Named built-in speaker: Japanese 日语男, Chinese 中文男. | Raw sentence plus natural-language control instruction. |
fish_audio_s1_mini | No speaker or emotion reference WAV. | Inline marker such as (joyful), (sad), (angry), (scared), or (disgusted). |
voxcpm2 | No prompt/reference WAV in the main run. | Control instruction wrapped inline before the text. |
indextts-2 | Dataset-derived speaker prompt WAVs: JVNV for Japanese, CSEMOTIONS for Chinese. | Raw sentence plus text emotion conditioning through emo_text. |
Metrics
- SenseVoice emotion accuracy: primary automatic screen. SenseVoice predictions are mapped to the six benchmark labels;
surprisedandunknowncount as non-matches. - emotion2vec anchor hit and margin: secondary diagnostic using human emotional-speech anchor centroids from CSEMOTIONS for Chinese and JVNV for Japanese.
- CER: faster-whisper-large-v3 transcription against the original prompt text, used to verify that emotional expression did not break the spoken content.
- NISQA-TTS: primary naturalness diagnostic for synthesized speech.
- UTMOS: secondary quality diagnostic; useful as a warning signal, but harsher and more out-of-domain for Japanese/Chinese.
- RTF: real-time factor for synthesis speed.
Results
Resource Usage
Resource metrics come from metrics/generation_runs.csv for the 600 successful generated rows. They are operational diagnostics rather than strict hardware benchmarks: GPU, VRAM, wall time, and RTF are populated for all completed rows, while CPU is not captured for server-backed adapters that run outside the sampled process tree.
| Model | Median wall time | Median RTF | Median peak VRAM | GPU util | GPU power | CPU | Median peak RSS |
|---|---|---|---|---|---|---|---|
cosyvoice_300m_instruct | 2.26s | 0.85 | 3.96 GB | 30.3% avg / 39.0% peak | 145.0W avg / 155.6W peak | 127.8% peak; 100% coverage | 5.54 GB |
qwen3_tts_customvoice_1_7b | 4.20s | 1.58 | 8.13 GB | 22.9% avg / 25.0% peak | 126.3W avg / 127.1W peak | 138.1% peak; 100% coverage | 6.22 GB |
fish_audio_s1_mini | 7.06s | 3.47 | 13.05 GB | 25.3% avg / 69.0% peak | 150.4W avg / 183.7W peak | not captured; 0% coverage | 0.80 GB |
indextts-2 | 26.39s | 6.97 | 7.29 GB | 18.2% avg / 100.0% peak | 131.3W avg / 199.6W peak | not captured; 0% coverage | 7.69 GB |
voxcpm2 | 28.44s | 9.84 | 12.79 GB | 12.3% avg / 100.0% peak | 106.7W avg / 191.5W peak | not captured; 0% coverage | 10.65 GB |
CosyVoice is the fastest and lowest-VRAM model in this run, but it is not the strongest emotion-control candidate. Qwen3-TTS requires more VRAM than CosyVoice but remains much faster than IndexTTS-2 and VoxCPM2 while keeping the best balance of emotion recognition and text fidelity. Fish Audio has a small process RSS footprint, but its GPU memory footprint is the largest of the completed models.
JA/ZH Metrics Overview
This split table is the quickest way to compare Japanese and Chinese behavior across the three core automatic checks: SenseVoice emotion accuracy, CER text fidelity, and emotion2vec anchor alignment.
| Model | JA SenseVoice | ZH SenseVoice | JA CER | ZH CER | JA anchor hit | ZH anchor hit | JA anchor margin | ZH anchor margin |
|---|---|---|---|---|---|---|---|---|
qwen3_tts_customvoice_1_7b | 15.0% | 53.3% | 8.6% | 9.7% | 40.0% | 64.0% | -0.06645 | 0.04480 |
indextts-2 | 43.3% | 16.7% | 91.0% | 10.3% | 38.0% | 30.0% | -0.08293 | -0.04063 |
voxcpm2 | 6.7% | 35.0% | 18.6% | 4.4% | 40.0% | 36.0% | -0.04479 | -0.02693 |
cosyvoice_300m_instruct | 1.7% | 36.7% | 43.9% | 11.1% | 24.0% | 72.0% | -0.05481 | 0.03796 |
fish_audio_s1_mini | 6.7% | 16.7% | 12.7% | 16.8% | 20.0% | 24.0% | -0.08972 | -0.09542 |
Chinese is generally easier for the automatic emotion metrics, but CER and emotion accuracy do not always move together. Qwen3-TTS keeps CER low in both languages, while IndexTTS-2 has the highest Japanese SenseVoice score and also the worst Japanese CER.
Text Fidelity (CER)
For text fidelity, Qwen3-TTS is the most stable JA/ZH result: Japanese CER is 8.6% and Chinese CER is 9.7%. IndexTTS-2 is the warning case. Its pooled emotion score looks competitive, but its Japanese CER reaches 91.0%, so the generated Japanese text path is not reliable enough in this setup.
Emotion Accuracy
SenseVoice
Chinese is clearly easier than Japanese in this automatic setup. For Qwen3-TTS, Chinese SenseVoice accuracy is 53.3% while Japanese is 15.0%, even though CER is low in both languages. That suggests the issue is not just intelligibility; the emotional cues recognized by SenseVoice are much weaker or less aligned in Japanese.
fear and disgust are the hardest labels. SenseVoice recall is 0.0% for both emotions across all evaluated model/language pairs. These labels often collapse into sad, neutral, angry, or unknown.
Rows are target emotions and columns are SenseVoice predictions. Green boxes mark the ideal diagonal.
Compact failure-mode highlights:
| Case | What happened | Why it matters |
|---|---|---|
indextts-2 / ja | happy -> sad 4/10; fear -> sad 5/10; disgust -> angry 10/10. | Emotion labels may look plausible even when Japanese text quality is unreliable. |
qwen3_tts_customvoice_1_7b / zh | happy -> neutral 5/10; fear -> sad 9/10; disgust -> neutral 9/10. | Qwen is the balanced winner, but hard emotions still collapse. |
cosyvoice_300m_instruct / ja | happy -> unknown 10/10; fear -> unknown 9/10; disgust -> unknown 8/10. | Naturalness does not guarantee recognizable emotional control. |
fish_audio_s1_mini / zh | happy -> neutral 10/10; fear -> neutral 9/10; disgust -> neutral 8/10. | Inline emotion markers did not reliably shift the generated prosody. |
voxcpm2 / zh | happy -> neutral 7/10; fear -> neutral 6/10; disgust -> neutral 10/10. | Prompt-driven control often collapsed into neutral speech. |
emotion2vec Anchors
The anchor metric tells a similar story to SenseVoice: Chinese anchors are more favorable than Japanese anchors. A positive margin means the generated audio is closer to the target emotion centroid than to the nearest non-target centroid. Qwen3-TTS has a positive Chinese margin, while every Japanese margin is negative.
Unlike SenseVoice, the anchor diagnostic is a centroid-similarity check rather than a label classifier, so the useful visual is the hit/margin split rather than a confusion matrix.
Naturalness
| Model | Mean NISQA-TTS | Low NISQA-TTS <3.0 | Mean UTMOS | Low UTMOS <3.0 |
|---|---|---|---|---|
cosyvoice_300m_instruct | 4.267 | 0.0% | 3.282 | 20.8% |
indextts-2 | 4.063 | 11.7% | 2.078 | 93.3% |
qwen3_tts_customvoice_1_7b | 4.007 | 0.8% | 2.939 | 51.7% |
fish_audio_s1_mini | 3.935 | 3.3% | 2.932 | 55.8% |
voxcpm2 | 3.788 | 8.3% | 2.596 | 76.7% |
Naturalness and emotional correctness are different questions. CosyVoice is the clearest naturalness winner, but it is not the emotion-control winner. Qwen3-TTS is slightly behind CosyVoice on NISQA-TTS, but substantially better on the balanced emotion/intelligibility trade-off.
Listening Examples
The table below uses the same prompt index for happy and angry in Japanese and Chinese. These clips are not a human listening test; they are qualitative anchors for the automatic metrics.
| Model | Language | Target | SenseVoice prediction | Sample |
|---|---|---|---|---|
qwen3_tts_customvoice_1_7b | JA | happy | unknown | |
qwen3_tts_customvoice_1_7b | JA | angry | angry | |
qwen3_tts_customvoice_1_7b | ZH | happy | neutral | |
qwen3_tts_customvoice_1_7b | ZH | angry | angry | |
cosyvoice_300m_instruct | JA | happy | unknown | |
cosyvoice_300m_instruct | JA | angry | unknown | |
cosyvoice_300m_instruct | ZH | happy | happy | |
cosyvoice_300m_instruct | ZH | angry | neutral | |
indextts-2 | JA | happy | sad | |
indextts-2 | JA | angry | surprised | |
indextts-2 | ZH | happy | neutral | |
indextts-2 | ZH | angry | neutral | |
fish_audio_s1_mini | JA | happy | happy | |
fish_audio_s1_mini | JA | angry | happy | |
fish_audio_s1_mini | ZH | happy | neutral | |
fish_audio_s1_mini | ZH | angry | neutral | |
voxcpm2 | JA | happy | unknown | |
voxcpm2 | JA | angry | angry | |
voxcpm2 | ZH | happy | happy | |
voxcpm2 | ZH | angry | angry |
Limitations
- Automatic emotion labels are not human judgment. SenseVoice is useful because it supports Japanese and Chinese and emits labels that map to the benchmark, but it can have classifier bias and language imbalance.
- Anchor metrics depend on the anchor datasets. Japanese anchors come from JVNV and Chinese anchors from CSEMOTIONS;
ja/neutralandzh/disgustanchors were missing in this run. - IndexTTS-2 Japanese is diagnostic, not production evidence. Its pooled emotion score looks strong, but Japanese CER is too high in this setup.
Further Research
- Run a small native-listener MOS/CMOS test for Qwen3-TTS and CosyVoice, with separate ratings for naturalness, emotion correctness, and text intelligibility.
- Treat IndexTTS-2 as Chinese-only for now, or rerun it after fixing the Japanese tokenizer/text path.
- Add or curate missing
ja/neutralandzh/disgustemotion anchors. - Run a focused Chinese human check for
sad,angry,fear, anddisgust, where automatic metrics show strong differences between easy and hard labels. - Keep SenseVoice as an automatic screening metric, but make final production decisions with human listening tests.
Conclusion
For Japanese and Chinese emotional TTS, Qwen3-TTS CustomVoice 1.7B is the strongest balanced model in this benchmark. It does not solve every emotion, but it combines the best practical mix of emotion recognition, low CER, anchor hit rate, naturalness, and runtime.
CosyVoice 300M Instruct is the naturalness leader and remains worth testing in human listening studies, but it should not be treated as solved six-emotion control. IndexTTS-2 is diagnostically interesting, especially for Chinese, but the Japanese results should not be trusted until the text path is fixed.
The biggest open problem is not raw naturalness. It is reliable, language-consistent emotion control. Chinese is easier than Japanese in this setup, and fear and disgust remain open problems across the evaluated models.


