Japanese and Chinese Emotional TTS Benchmark

Models and references:

Qwen3-TTS CustomVoice 1.7B - custom-voice TTS with explicit emotional prompting.
CosyVoice 300M Instruct / CosyVoice2 - instruction-style TTS baseline with named Japanese and Chinese speakers.
Fish Audio S1-mini - expressive TTS model with inline emotion markers.
VoxCPM2 - multilingual prompt-driven TTS model.
IndexTTS-2 - emotional zero-shot TTS model evaluated here as an experimental Japanese/Chinese comparison.

Abstract

We benchmarked five emotional text-to-speech systems for Japanese and Chinese across six target emotions: neutral, happy, sad, angry, fear, and disgust. The evaluation uses neutral prompts so the requested emotion must come from speech style, not from emotionally loaded text. Each model generated 120 samples, for a 600-WAV main benchmark corpus across the five completed systems.

The strongest balanced candidate is Qwen3-TTS CustomVoice 1.7B: it has the best pooled SenseVoice accuracy among models with trustworthy Japanese and Chinese text output, the lowest mean CER, the best anchor hit rate, and strong NISQA-TTS naturalness. CosyVoice 300M Instruct is the naturalness leader, but emotion recognition is weak, especially in Japanese. IndexTTS-2 reaches a high pooled SenseVoice score, but its Japanese CER is too high to treat that result as reliable Japanese TTS.

The most important pattern is language and emotion imbalance: Chinese is consistently easier than Japanese in this automatic setup, while fear and disgust remain unsolved across all evaluated models.

Motivation

Emotional TTS is not just a naturalness problem. A model can sound fluent and pleasant while failing to express the requested style. For product use cases such as multilingual avatars, customer support voices, training simulations, or expressive speech translation, we need to know whether a TTS system can keep three things aligned at once:

It says the intended Japanese or Chinese sentence.
It sounds natural enough to listen to.
It expresses the requested emotion rather than collapsing into neutral speech or a nearby emotion.

CLAP-style audio-text similarity is useful for broad retrieval, but it is too indirect for a six-label emotional TTS benchmark. This evaluation combines discrete emotion recognition, continuous emotion anchors, transcription correctness, naturalness predictors, runtime, and listening samples. The goal is not to declare a final production winner from automatic metrics alone; it is to screen models and identify which systems deserve human listening tests.

Evaluation Methodology

The benchmark uses a balanced generation grid across language, emotion, and prompt text:

Experiment design

The same sentence is reused across all six emotions. This keeps the task clean: if a Japanese sentence says “The meeting starts at 10 a.m.” or a Chinese sentence says “The documents are on the desk,” the model cannot rely on emotional text content. It must express the requested emotion through speech.

Prompt Set

Example Japanese prompts:

ID	Sentence
`ja_001`	会議は午前十時に始まります。
`ja_002`	資料は机の上に置いてあります。
`ja_003`	明日の予定を確認してください。
`ja_004`	電車は三番線から出発します。
`ja_005`	受付で名前を伝えてください。

Example Chinese prompts:

ID	Sentence
`zh_001`	会议将在上午十点开始。
`zh_002`	资料已经放在桌子上。
`zh_003`	请确认明天的日程安排。
`zh_004`	列车将从三号站台出发。
`zh_005`	请在前台告知您的姓名。

Emotion Controls

Target emotion	Control text
`neutral`	Speak in a clear, neutral, natural voice.
`happy`	Speak in a happy, warm, bright voice.
`sad`	Speak in a sad, soft, slow, gentle voice.
`angry`	Speak in an angry, tense, forceful voice.
`fear`	Speak in a fearful, tense, trembling voice.
`disgust`	Speak in a disgusted, displeased, rejecting voice.

Each model receives the same target label and text, but the actual control interface is model-specific:

Model	Speaker/reference input used	Emotion control
`qwen3_tts_customvoice_1_7b`	Predefined CustomVoice speaker `Ryan`.	Raw sentence plus natural-language control instruction.
`cosyvoice_300m_instruct`	Named built-in speaker: Japanese `日语男`, Chinese `中文男`.	Raw sentence plus natural-language control instruction.
`fish_audio_s1_mini`	No speaker or emotion reference WAV.	Inline marker such as `(joyful)`, `(sad)`, `(angry)`, `(scared)`, or `(disgusted)`.
`voxcpm2`	No prompt/reference WAV in the main run.	Control instruction wrapped inline before the text.
`indextts-2`	Dataset-derived speaker prompt WAVs: JVNV for Japanese, CSEMOTIONS for Chinese.	Raw sentence plus text emotion conditioning through `emo_text`.

Metrics

SenseVoice emotion accuracy: primary automatic screen. SenseVoice predictions are mapped to the six benchmark labels; surprised and unknown count as non-matches.
emotion2vec anchor hit and margin: secondary diagnostic using human emotional-speech anchor centroids from CSEMOTIONS for Chinese and JVNV for Japanese.
CER: faster-whisper-large-v3 transcription against the original prompt text, used to verify that emotional expression did not break the spoken content.
NISQA-TTS: primary naturalness diagnostic for synthesized speech.
UTMOS: secondary quality diagnostic; useful as a warning signal, but harsher and more out-of-domain for Japanese/Chinese.
RTF: real-time factor for synthesis speed.

Results

Resource Usage

Resource metrics come from metrics/generation_runs.csv for the 600 successful generated rows. They are operational diagnostics rather than strict hardware benchmarks: GPU, VRAM, wall time, and RTF are populated for all completed rows, while CPU is not captured for server-backed adapters that run outside the sampled process tree.

Model	Median wall time	Median RTF	Median peak VRAM	GPU util	GPU power	CPU	Median peak RSS
`cosyvoice_300m_instruct`	2.26s	0.85	3.96 GB	30.3% avg / 39.0% peak	145.0W avg / 155.6W peak	127.8% peak; 100% coverage	5.54 GB
`qwen3_tts_customvoice_1_7b`	4.20s	1.58	8.13 GB	22.9% avg / 25.0% peak	126.3W avg / 127.1W peak	138.1% peak; 100% coverage	6.22 GB
`fish_audio_s1_mini`	7.06s	3.47	13.05 GB	25.3% avg / 69.0% peak	150.4W avg / 183.7W peak	not captured; 0% coverage	0.80 GB
`indextts-2`	26.39s	6.97	7.29 GB	18.2% avg / 100.0% peak	131.3W avg / 199.6W peak	not captured; 0% coverage	7.69 GB
`voxcpm2`	28.44s	9.84	12.79 GB	12.3% avg / 100.0% peak	106.7W avg / 191.5W peak	not captured; 0% coverage	10.65 GB

CosyVoice is the fastest and lowest-VRAM model in this run, but it is not the strongest emotion-control candidate. Qwen3-TTS requires more VRAM than CosyVoice but remains much faster than IndexTTS-2 and VoxCPM2 while keeping the best balance of emotion recognition and text fidelity. Fish Audio has a small process RSS footprint, but its GPU memory footprint is the largest of the completed models.

JA/ZH Metrics Overview

This split table is the quickest way to compare Japanese and Chinese behavior across the three core automatic checks: SenseVoice emotion accuracy, CER text fidelity, and emotion2vec anchor alignment.

Model	JA SenseVoice	ZH SenseVoice	JA CER	ZH CER	JA anchor hit	ZH anchor hit	JA anchor margin	ZH anchor margin
`qwen3_tts_customvoice_1_7b`	15.0%	53.3%	8.6%	9.7%	40.0%	64.0%	-0.06645	0.04480
`indextts-2`	43.3%	16.7%	91.0%	10.3%	38.0%	30.0%	-0.08293	-0.04063
`voxcpm2`	6.7%	35.0%	18.6%	4.4%	40.0%	36.0%	-0.04479	-0.02693
`cosyvoice_300m_instruct`	1.7%	36.7%	43.9%	11.1%	24.0%	72.0%	-0.05481	0.03796
`fish_audio_s1_mini`	6.7%	16.7%	12.7%	16.8%	20.0%	24.0%	-0.08972	-0.09542

Chinese is generally easier for the automatic emotion metrics, but CER and emotion accuracy do not always move together. Qwen3-TTS keeps CER low in both languages, while IndexTTS-2 has the highest Japanese SenseVoice score and also the worst Japanese CER.

Text Fidelity (CER)

CER by language

For text fidelity, Qwen3-TTS is the most stable JA/ZH result: Japanese CER is 8.6% and Chinese CER is 9.7%. IndexTTS-2 is the warning case. Its pooled emotion score looks competitive, but its Japanese CER reaches 91.0%, so the generated Japanese text path is not reliable enough in this setup.

Emotion Accuracy

SenseVoice

SenseVoice accuracy by language

Chinese is clearly easier than Japanese in this automatic setup. For Qwen3-TTS, Chinese SenseVoice accuracy is 53.3% while Japanese is 15.0%, even though CER is low in both languages. That suggests the issue is not just intelligibility; the emotional cues recognized by SenseVoice are much weaker or less aligned in Japanese.

Per-emotion SenseVoice recall by model and language

fear and disgust are the hardest labels. SenseVoice recall is 0.0% for both emotions across all evaluated model/language pairs. These labels often collapse into sad, neutral, angry, or unknown.

Rows are target emotions and columns are SenseVoice predictions. Green boxes mark the ideal diagonal.

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

Compact failure-mode highlights:

Case	What happened	Why it matters
`indextts-2 / ja`	`happy` -> `sad` 4/10; `fear` -> `sad` 5/10; `disgust` -> `angry` 10/10.	Emotion labels may look plausible even when Japanese text quality is unreliable.
`qwen3_tts_customvoice_1_7b / zh`	`happy` -> `neutral` 5/10; `fear` -> `sad` 9/10; `disgust` -> `neutral` 9/10.	Qwen is the balanced winner, but hard emotions still collapse.
`cosyvoice_300m_instruct / ja`	`happy` -> `unknown` 10/10; `fear` -> `unknown` 9/10; `disgust` -> `unknown` 8/10.	Naturalness does not guarantee recognizable emotional control.
`fish_audio_s1_mini / zh`	`happy` -> `neutral` 10/10; `fear` -> `neutral` 9/10; `disgust` -> `neutral` 8/10.	Inline emotion markers did not reliably shift the generated prosody.
`voxcpm2 / zh`	`happy` -> `neutral` 7/10; `fear` -> `neutral` 6/10; `disgust` -> `neutral` 10/10.	Prompt-driven control often collapsed into neutral speech.

emotion2vec Anchors

emotion2vec anchor hit and margin by language

The anchor metric tells a similar story to SenseVoice: Chinese anchors are more favorable than Japanese anchors. A positive margin means the generated audio is closer to the target emotion centroid than to the nearest non-target centroid. Qwen3-TTS has a positive Chinese margin, while every Japanese margin is negative.

Unlike SenseVoice, the anchor diagnostic is a centroid-similarity check rather than a label classifier, so the useful visual is the hit/margin split rather than a confusion matrix.

Naturalness

Naturalness diagnostics by model

Model	Mean NISQA-TTS	Low NISQA-TTS <3.0	Mean UTMOS	Low UTMOS <3.0
`cosyvoice_300m_instruct`	4.267	0.0%	3.282	20.8%
`indextts-2`	4.063	11.7%	2.078	93.3%
`qwen3_tts_customvoice_1_7b`	4.007	0.8%	2.939	51.7%
`fish_audio_s1_mini`	3.935	3.3%	2.932	55.8%
`voxcpm2`	3.788	8.3%	2.596	76.7%

Naturalness and emotional correctness are different questions. CosyVoice is the clearest naturalness winner, but it is not the emotion-control winner. Qwen3-TTS is slightly behind CosyVoice on NISQA-TTS, but substantially better on the balanced emotion/intelligibility trade-off.

Listening Examples

The table below uses the same prompt index for happy and angry in Japanese and Chinese. These clips are not a human listening test; they are qualitative anchors for the automatic metrics.

Model	Language	Target	SenseVoice prediction
`qwen3_tts_customvoice_1_7b`	JA	happy	unknown
`qwen3_tts_customvoice_1_7b`	JA	angry	angry
`qwen3_tts_customvoice_1_7b`	ZH	happy	neutral
`qwen3_tts_customvoice_1_7b`	ZH	angry	angry
`cosyvoice_300m_instruct`	JA	happy	unknown
`cosyvoice_300m_instruct`	JA	angry	unknown
`cosyvoice_300m_instruct`	ZH	happy	happy
`cosyvoice_300m_instruct`	ZH	angry	neutral
`indextts-2`	JA	happy	sad
`indextts-2`	JA	angry	surprised
`indextts-2`	ZH	happy	neutral
`indextts-2`	ZH	angry	neutral
`fish_audio_s1_mini`	JA	happy	happy
`fish_audio_s1_mini`	JA	angry	happy
`fish_audio_s1_mini`	ZH	happy	neutral
`fish_audio_s1_mini`	ZH	angry	neutral
`voxcpm2`	JA	happy	unknown
`voxcpm2`	JA	angry	angry
`voxcpm2`	ZH	happy	happy
`voxcpm2`	ZH	angry	angry

Limitations

Automatic emotion labels are not human judgment. SenseVoice is useful because it supports Japanese and Chinese and emits labels that map to the benchmark, but it can have classifier bias and language imbalance.
Anchor metrics depend on the anchor datasets. Japanese anchors come from JVNV and Chinese anchors from CSEMOTIONS; ja/neutral and zh/disgust anchors were missing in this run.
IndexTTS-2 Japanese is diagnostic, not production evidence. Its pooled emotion score looks strong, but Japanese CER is too high in this setup.

Further Research

Run a small native-listener MOS/CMOS test for Qwen3-TTS and CosyVoice, with separate ratings for naturalness, emotion correctness, and text intelligibility.
Treat IndexTTS-2 as Chinese-only for now, or rerun it after fixing the Japanese tokenizer/text path.
Add or curate missing ja/neutral and zh/disgust emotion anchors.
Run a focused Chinese human check for sad, angry, fear, and disgust, where automatic metrics show strong differences between easy and hard labels.
Keep SenseVoice as an automatic screening metric, but make final production decisions with human listening tests.

Conclusion

For Japanese and Chinese emotional TTS, Qwen3-TTS CustomVoice 1.7B is the strongest balanced model in this benchmark. It does not solve every emotion, but it combines the best practical mix of emotion recognition, low CER, anchor hit rate, naturalness, and runtime.

CosyVoice 300M Instruct is the naturalness leader and remains worth testing in human listening studies, but it should not be treated as solved six-emotion control. IndexTTS-2 is diagnostically interesting, especially for Chinese, but the Japanese results should not be trusted until the text path is fixed.

The biggest open problem is not raw naturalness. It is reliable, language-consistent emotion control. Chinese is easier than Japanese in this setup, and fear and disgust remain open problems across the evaluated models.

Emotional TTS Benchmark: Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio, and VoxCPM for Japanese and Chinese