Japanese and Chinese Emotional TTS Benchmark | VoicePing
Text to Speech Emotional TTS Benchmark Qwen3 TTS CosyVoice IndexTTS Fish Audio VoxCPM Japanese Chinese Speech AI

Emotional TTS Benchmark: Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio, and VoxCPM for Japanese and Chinese

VoicePing Research 10 min read
Emotional TTS Benchmark: Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio, and VoxCPM for Japanese and Chinese

Benchmarking five emotional text-to-speech models for Japanese and Chinese across six target emotions, with SenseVoice emotion recognition, emotion2vec anchors, CER, naturalness, runtime, and listening examples.

Models and references:

Abstract

We benchmarked five emotional text-to-speech systems for Japanese and Chinese across six target emotions: neutral, happy, sad, angry, fear, and disgust. The evaluation uses neutral prompts so the requested emotion must come from speech style, not from emotionally loaded text. Each model generated 120 samples, for a 600-WAV main benchmark corpus across the five completed systems.

The strongest balanced candidate is Qwen3-TTS CustomVoice 1.7B: it has the best pooled SenseVoice accuracy among models with trustworthy Japanese and Chinese text output, the lowest mean CER, the best anchor hit rate, and strong NISQA-TTS naturalness. CosyVoice 300M Instruct is the naturalness leader, but emotion recognition is weak, especially in Japanese. IndexTTS-2 reaches a high pooled SenseVoice score, but its Japanese CER is too high to treat that result as reliable Japanese TTS.

The most important pattern is language and emotion imbalance: Chinese is consistently easier than Japanese in this automatic setup, while fear and disgust remain unsolved across all evaluated models.

Motivation

Emotional TTS is not just a naturalness problem. A model can sound fluent and pleasant while failing to express the requested style. For product use cases such as multilingual avatars, customer support voices, training simulations, or expressive speech translation, we need to know whether a TTS system can keep three things aligned at once:

  • It says the intended Japanese or Chinese sentence.
  • It sounds natural enough to listen to.
  • It expresses the requested emotion rather than collapsing into neutral speech or a nearby emotion.

CLAP-style audio-text similarity is useful for broad retrieval, but it is too indirect for a six-label emotional TTS benchmark. This evaluation combines discrete emotion recognition, continuous emotion anchors, transcription correctness, naturalness predictors, runtime, and listening samples. The goal is not to declare a final production winner from automatic metrics alone; it is to screen models and identify which systems deserve human listening tests.

Evaluation Methodology

The benchmark uses a balanced generation grid across language, emotion, and prompt text:

Experiment design

The same sentence is reused across all six emotions. This keeps the task clean: if a Japanese sentence says “The meeting starts at 10 a.m.” or a Chinese sentence says “The documents are on the desk,” the model cannot rely on emotional text content. It must express the requested emotion through speech.

Prompt Set

Example Japanese prompts:

IDSentence
ja_001会議は午前十時に始まります。
ja_002資料は机の上に置いてあります。
ja_003明日の予定を確認してください。
ja_004電車は三番線から出発します。
ja_005受付で名前を伝えてください。

Example Chinese prompts:

IDSentence
zh_001会议将在上午十点开始。
zh_002资料已经放在桌子上。
zh_003请确认明天的日程安排。
zh_004列车将从三号站台出发。
zh_005请在前台告知您的姓名。

Emotion Controls

Target emotionControl text
neutralSpeak in a clear, neutral, natural voice.
happySpeak in a happy, warm, bright voice.
sadSpeak in a sad, soft, slow, gentle voice.
angrySpeak in an angry, tense, forceful voice.
fearSpeak in a fearful, tense, trembling voice.
disgustSpeak in a disgusted, displeased, rejecting voice.

Each model receives the same target label and text, but the actual control interface is model-specific:

ModelSpeaker/reference input usedEmotion control
qwen3_tts_customvoice_1_7bPredefined CustomVoice speaker Ryan.Raw sentence plus natural-language control instruction.
cosyvoice_300m_instructNamed built-in speaker: Japanese 日语男, Chinese 中文男.Raw sentence plus natural-language control instruction.
fish_audio_s1_miniNo speaker or emotion reference WAV.Inline marker such as (joyful), (sad), (angry), (scared), or (disgusted).
voxcpm2No prompt/reference WAV in the main run.Control instruction wrapped inline before the text.
indextts-2Dataset-derived speaker prompt WAVs: JVNV for Japanese, CSEMOTIONS for Chinese.Raw sentence plus text emotion conditioning through emo_text.

Metrics

  • SenseVoice emotion accuracy: primary automatic screen. SenseVoice predictions are mapped to the six benchmark labels; surprised and unknown count as non-matches.
  • emotion2vec anchor hit and margin: secondary diagnostic using human emotional-speech anchor centroids from CSEMOTIONS for Chinese and JVNV for Japanese.
  • CER: faster-whisper-large-v3 transcription against the original prompt text, used to verify that emotional expression did not break the spoken content.
  • NISQA-TTS: primary naturalness diagnostic for synthesized speech.
  • UTMOS: secondary quality diagnostic; useful as a warning signal, but harsher and more out-of-domain for Japanese/Chinese.
  • RTF: real-time factor for synthesis speed.

Results

Resource Usage

Resource metrics come from metrics/generation_runs.csv for the 600 successful generated rows. They are operational diagnostics rather than strict hardware benchmarks: GPU, VRAM, wall time, and RTF are populated for all completed rows, while CPU is not captured for server-backed adapters that run outside the sampled process tree.

ModelMedian wall timeMedian RTFMedian peak VRAMGPU utilGPU powerCPUMedian peak RSS
cosyvoice_300m_instruct2.26s0.853.96 GB30.3% avg / 39.0% peak145.0W avg / 155.6W peak127.8% peak; 100% coverage5.54 GB
qwen3_tts_customvoice_1_7b4.20s1.588.13 GB22.9% avg / 25.0% peak126.3W avg / 127.1W peak138.1% peak; 100% coverage6.22 GB
fish_audio_s1_mini7.06s3.4713.05 GB25.3% avg / 69.0% peak150.4W avg / 183.7W peaknot captured; 0% coverage0.80 GB
indextts-226.39s6.977.29 GB18.2% avg / 100.0% peak131.3W avg / 199.6W peaknot captured; 0% coverage7.69 GB
voxcpm228.44s9.8412.79 GB12.3% avg / 100.0% peak106.7W avg / 191.5W peaknot captured; 0% coverage10.65 GB

CosyVoice is the fastest and lowest-VRAM model in this run, but it is not the strongest emotion-control candidate. Qwen3-TTS requires more VRAM than CosyVoice but remains much faster than IndexTTS-2 and VoxCPM2 while keeping the best balance of emotion recognition and text fidelity. Fish Audio has a small process RSS footprint, but its GPU memory footprint is the largest of the completed models.

JA/ZH Metrics Overview

This split table is the quickest way to compare Japanese and Chinese behavior across the three core automatic checks: SenseVoice emotion accuracy, CER text fidelity, and emotion2vec anchor alignment.

ModelJA SenseVoiceZH SenseVoiceJA CERZH CERJA anchor hitZH anchor hitJA anchor marginZH anchor margin
qwen3_tts_customvoice_1_7b15.0%53.3%8.6%9.7%40.0%64.0%-0.066450.04480
indextts-243.3%16.7%91.0%10.3%38.0%30.0%-0.08293-0.04063
voxcpm26.7%35.0%18.6%4.4%40.0%36.0%-0.04479-0.02693
cosyvoice_300m_instruct1.7%36.7%43.9%11.1%24.0%72.0%-0.054810.03796
fish_audio_s1_mini6.7%16.7%12.7%16.8%20.0%24.0%-0.08972-0.09542

Chinese is generally easier for the automatic emotion metrics, but CER and emotion accuracy do not always move together. Qwen3-TTS keeps CER low in both languages, while IndexTTS-2 has the highest Japanese SenseVoice score and also the worst Japanese CER.

Text Fidelity (CER)

CER by language

For text fidelity, Qwen3-TTS is the most stable JA/ZH result: Japanese CER is 8.6% and Chinese CER is 9.7%. IndexTTS-2 is the warning case. Its pooled emotion score looks competitive, but its Japanese CER reaches 91.0%, so the generated Japanese text path is not reliable enough in this setup.

Emotion Accuracy

SenseVoice

SenseVoice accuracy by language

Chinese is clearly easier than Japanese in this automatic setup. For Qwen3-TTS, Chinese SenseVoice accuracy is 53.3% while Japanese is 15.0%, even though CER is low in both languages. That suggests the issue is not just intelligibility; the emotional cues recognized by SenseVoice are much weaker or less aligned in Japanese.

Per-emotion SenseVoice recall by model and language

fear and disgust are the hardest labels. SenseVoice recall is 0.0% for both emotions across all evaluated model/language pairs. These labels often collapse into sad, neutral, angry, or unknown.

Rows are target emotions and columns are SenseVoice predictions. Green boxes mark the ideal diagonal.

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

Compact failure-mode highlights:

CaseWhat happenedWhy it matters
indextts-2 / jahappy -> sad 4/10; fear -> sad 5/10; disgust -> angry 10/10.Emotion labels may look plausible even when Japanese text quality is unreliable.
qwen3_tts_customvoice_1_7b / zhhappy -> neutral 5/10; fear -> sad 9/10; disgust -> neutral 9/10.Qwen is the balanced winner, but hard emotions still collapse.
cosyvoice_300m_instruct / jahappy -> unknown 10/10; fear -> unknown 9/10; disgust -> unknown 8/10.Naturalness does not guarantee recognizable emotional control.
fish_audio_s1_mini / zhhappy -> neutral 10/10; fear -> neutral 9/10; disgust -> neutral 8/10.Inline emotion markers did not reliably shift the generated prosody.
voxcpm2 / zhhappy -> neutral 7/10; fear -> neutral 6/10; disgust -> neutral 10/10.Prompt-driven control often collapsed into neutral speech.

emotion2vec Anchors

emotion2vec anchor hit and margin by language

The anchor metric tells a similar story to SenseVoice: Chinese anchors are more favorable than Japanese anchors. A positive margin means the generated audio is closer to the target emotion centroid than to the nearest non-target centroid. Qwen3-TTS has a positive Chinese margin, while every Japanese margin is negative.

Unlike SenseVoice, the anchor diagnostic is a centroid-similarity check rather than a label classifier, so the useful visual is the hit/margin split rather than a confusion matrix.

Naturalness

Naturalness diagnostics by model

ModelMean NISQA-TTSLow NISQA-TTS <3.0Mean UTMOSLow UTMOS <3.0
cosyvoice_300m_instruct4.2670.0%3.28220.8%
indextts-24.06311.7%2.07893.3%
qwen3_tts_customvoice_1_7b4.0070.8%2.93951.7%
fish_audio_s1_mini3.9353.3%2.93255.8%
voxcpm23.7888.3%2.59676.7%

Naturalness and emotional correctness are different questions. CosyVoice is the clearest naturalness winner, but it is not the emotion-control winner. Qwen3-TTS is slightly behind CosyVoice on NISQA-TTS, but substantially better on the balanced emotion/intelligibility trade-off.

Listening Examples

The table below uses the same prompt index for happy and angry in Japanese and Chinese. These clips are not a human listening test; they are qualitative anchors for the automatic metrics.

ModelLanguageTargetSenseVoice predictionSample
qwen3_tts_customvoice_1_7bJAhappyunknown
qwen3_tts_customvoice_1_7bJAangryangry
qwen3_tts_customvoice_1_7bZHhappyneutral
qwen3_tts_customvoice_1_7bZHangryangry
cosyvoice_300m_instructJAhappyunknown
cosyvoice_300m_instructJAangryunknown
cosyvoice_300m_instructZHhappyhappy
cosyvoice_300m_instructZHangryneutral
indextts-2JAhappysad
indextts-2JAangrysurprised
indextts-2ZHhappyneutral
indextts-2ZHangryneutral
fish_audio_s1_miniJAhappyhappy
fish_audio_s1_miniJAangryhappy
fish_audio_s1_miniZHhappyneutral
fish_audio_s1_miniZHangryneutral
voxcpm2JAhappyunknown
voxcpm2JAangryangry
voxcpm2ZHhappyhappy
voxcpm2ZHangryangry

Limitations

  • Automatic emotion labels are not human judgment. SenseVoice is useful because it supports Japanese and Chinese and emits labels that map to the benchmark, but it can have classifier bias and language imbalance.
  • Anchor metrics depend on the anchor datasets. Japanese anchors come from JVNV and Chinese anchors from CSEMOTIONS; ja/neutral and zh/disgust anchors were missing in this run.
  • IndexTTS-2 Japanese is diagnostic, not production evidence. Its pooled emotion score looks strong, but Japanese CER is too high in this setup.

Further Research

  • Run a small native-listener MOS/CMOS test for Qwen3-TTS and CosyVoice, with separate ratings for naturalness, emotion correctness, and text intelligibility.
  • Treat IndexTTS-2 as Chinese-only for now, or rerun it after fixing the Japanese tokenizer/text path.
  • Add or curate missing ja/neutral and zh/disgust emotion anchors.
  • Run a focused Chinese human check for sad, angry, fear, and disgust, where automatic metrics show strong differences between easy and hard labels.
  • Keep SenseVoice as an automatic screening metric, but make final production decisions with human listening tests.

Conclusion

For Japanese and Chinese emotional TTS, Qwen3-TTS CustomVoice 1.7B is the strongest balanced model in this benchmark. It does not solve every emotion, but it combines the best practical mix of emotion recognition, low CER, anchor hit rate, naturalness, and runtime.

CosyVoice 300M Instruct is the naturalness leader and remains worth testing in human listening studies, but it should not be treated as solved six-emotion control. IndexTTS-2 is diagnostically interesting, especially for Chinese, but the Japanese results should not be trusted until the text path is fixed.

The biggest open problem is not raw naturalness. It is reliable, language-consistent emotion control. Chinese is easier than Japanese in this setup, and fear and disgust remain open problems across the evaluated models.

Share this article

Try VoicePing for Free

Break language barriers with AI translation. Start with our free plan today.

Get Started Free