Japanese aur Chinese Emotional TTS Benchmark

Models aur references:

Saaransh

Humne Japanese aur Chinese ke liye paanch emotional TTS systems ko chhe target emotions par benchmark kiya: neutral, happy, sad, angry, fear, aur disgust. Sentences neutral rakhe gaye, taaki emotion text se nahi balki speech style se aaye.

Sabse balanced candidate Qwen3-TTS CustomVoice 1.7B hai. Isme low CER, best anchor hit rate, strong naturalness, aur Japanese/Chinese ke liye sabse practical emotion balance dikha.

CosyVoice 300M Instruct naturalness mein aage hai, lekin emotion control weak hai. IndexTTS-2 ka pooled SenseVoice score achha dikhta hai, par Japanese CER bahut high hai. Chinese Japanese se easier hai, aur fear/disgust abhi unresolved hain.

Motivation

Emotional TTS sirf natural voice banane ka kaam nahi hai. Model ko sahi sentence bolna, sunne layak naturalness rakhna, aur requested emotion express karna hota hai. Isliye benchmark emotion recognition, emotion anchors, CER, naturalness, runtime aur listening samples ko saath dekhta hai.

Intended Japanese ya Chinese sentence correct rehna chahiye.
Speech real listening ke liye kaafi natural sound karni chahiye.
Generated voice ko requested emotion express karna chahiye, neutral speech ya nearby emotion mein collapse nahi hona chahiye.

Evaluation methodology

Benchmark language, emotion aur prompt text ke balanced generation grid ka use karta hai. Same sentence ko chhe emotions ke liye reuse kiya gaya, taaki model prosody aur voice style se emotion express kare.

Experiment design

Prompt set

Japanese prompt ke udaharan:

ID	Vakya
`ja_001`	会議は午前十時に始まります。
`ja_002`	資料は机の上に置いてあります。
`ja_003`	明日の予定を確認してください。
`ja_004`	電車は三番線から出発します。
`ja_005`	受付で名前を伝えてください。

Chinese prompt ke udaharan:

ID	Vakya
`zh_001`	会议将在上午十点开始。
`zh_002`	资料已经放在桌子上。
`zh_003`	请确认明天的日程安排。
`zh_004`	列车将从三号站台出发。
`zh_005`	请在前台告知您的姓名。

Emotion controls

Target emotion	Control text
`neutral`	Speak in a clear, neutral, natural voice.
`happy`	Speak in a happy, warm, bright voice.
`sad`	Speak in a sad, soft, slow, gentle voice.
`angry`	Speak in an angry, tense, forceful voice.
`fear`	Speak in a fearful, tense, trembling voice.
`disgust`	Speak in a disgusted, displeased, rejecting voice.

Metrics

SenseVoice emotion accuracy: main automatic screening metric.
emotion2vec anchor hit aur margin: emotional-speech anchor centroids par based secondary diagnostic metric.
CER: original prompt text ke against transcription error rate.
NISQA-TTS aur UTMOS: synthesized speech ki naturalness aur quality diagnostic metrics.
RTF: synthesis speed measure karne ke liye real-time factor.

Results

Resource usage

Resource metrics 600 successful generations se aaye hain. GPU, VRAM, wall time aur RTF sab completed rows mein hain; CPU server-backed adapters ke liye hamesha capture nahi hua.

Model	Median wall time	Median RTF	Median peak VRAM	GPU util	GPU power	CPU	Median peak RSS
`cosyvoice_300m_instruct`	2.26s	0.85	3.96 GB	30.3% avg / 39.0% peak	145.0W avg / 155.6W peak	127.8% peak; 100% coverage	5.54 GB
`qwen3_tts_customvoice_1_7b`	4.20s	1.58	8.13 GB	22.9% avg / 25.0% peak	126.3W avg / 127.1W peak	138.1% peak; 100% coverage	6.22 GB
`fish_audio_s1_mini`	7.06s	3.47	13.05 GB	25.3% avg / 69.0% peak	150.4W avg / 183.7W peak	not captured; 0% coverage	0.80 GB
`indextts-2`	26.39s	6.97	7.29 GB	18.2% avg / 100.0% peak	131.3W avg / 199.6W peak	not captured; 0% coverage	7.69 GB
`voxcpm2`	28.44s	9.84	12.79 GB	12.3% avg / 100.0% peak	106.7W avg / 191.5W peak	not captured; 0% coverage	10.65 GB

CosyVoice fastest aur lowest-VRAM model tha, par emotion-control candidate strongest nahi tha. Qwen3-TTS CosyVoice se zyada VRAM use karta hai, lekin IndexTTS-2 aur VoxCPM2 se kaafi fast hai aur best balance deta hai.

JA/ZH metrics overview

Yeh split table teen core automatic checks ko Japanese aur Chinese mein dikhata hai: SenseVoice emotion accuracy, CER text fidelity, aur emotion2vec anchor alignment.

Model	JA SenseVoice	ZH SenseVoice	JA CER	ZH CER	JA anchor hit	ZH anchor hit	JA anchor margin	ZH anchor margin
`qwen3_tts_customvoice_1_7b`	15.0%	53.3%	8.6%	9.7%	40.0%	64.0%	-0.06645	0.04480
`indextts-2`	43.3%	16.7%	91.0%	10.3%	38.0%	30.0%	-0.08293	-0.04063
`voxcpm2`	6.7%	35.0%	18.6%	4.4%	40.0%	36.0%	-0.04479	-0.02693
`cosyvoice_300m_instruct`	1.7%	36.7%	43.9%	11.1%	24.0%	72.0%	-0.05481	0.03796
`fish_audio_s1_mini`	6.7%	16.7%	12.7%	16.8%	20.0%	24.0%	-0.08972	-0.09542

Chinese automatic emotion metrics mein generally easier hai, par CER aur emotion accuracy hamesha saath nahi move karte. Qwen3-TTS dono languages mein CER low rakhta hai; IndexTTS-2 ka Japanese SenseVoice score highest hai, par Japanese CER bhi worst hai.

Text fidelity (CER)

CER by language

Text fidelity mein Qwen3-TTS sabse stable hai: Japanese CER 8.6% aur Chinese CER 9.7%. IndexTTS-2 warning case hai kyunki Japanese CER 91.0% tak pahunchta hai.

Emotion accuracy

SenseVoice

SenseVoice accuracy by language

Is automatic setup mein Chinese Japanese se clearly easier hai. Qwen3-TTS ke liye Chinese SenseVoice accuracy 53.3% hai aur Japanese 15.0%, jabki CER dono mein low hai.

Per-emotion SenseVoice recall by model and language

fear aur disgust sabse hard labels hain. Dono ka SenseVoice recall har model/language pair mein 0.0% hai, aur aksar sad, neutral, angry, ya unknown mein collapse hota hai.

Rows target emotions hain aur columns SenseVoice predictions hain. Green boxes ideal diagonal dikhate hain.

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

Case	Kya hua	Yeh kyun important hai
`indextts-2 / ja`	`happy` -> `sad` 4/10; `fear` -> `sad` 5/10; `disgust` -> `angry` 10/10.	Japanese text quality unreliable hone par bhi emotion labels plausible dikh sakte hain.
`qwen3_tts_customvoice_1_7b / zh`	`happy` -> `neutral` 5/10; `fear` -> `sad` 9/10; `disgust` -> `neutral` 9/10.	Qwen balanced winner hai, lekin hard emotions abhi bhi collapse hote hain.
`cosyvoice_300m_instruct / ja`	`happy` -> `unknown` 10/10; `fear` -> `unknown` 9/10; `disgust` -> `unknown` 8/10.	Naturalness recognizable emotional control ki guarantee nahi deti.
`fish_audio_s1_mini / zh`	`happy` -> `neutral` 10/10; `fear` -> `neutral` 9/10; `disgust` -> `neutral` 8/10.	Inline emotion markers generated prosody ko reliably shift nahi kar paaye.
`voxcpm2 / zh`	`happy` -> `neutral` 7/10; `fear` -> `neutral` 6/10; `disgust` -> `neutral` 10/10.	Prompt-driven control aksar neutral speech mein collapse hua.

emotion2vec anchors

emotion2vec anchor hit and margin by language

Anchor metric SenseVoice jaisi story batata hai: Chinese anchors Japanese anchors se zyada favorable hain. Positive margin ka matlab audio target emotion centroid ke closer hai. Qwen3-TTS Chinese margin positive hai, sab Japanese margins negative hain.

Naturalness

Naturalness diagnostics by model

Model	Mean NISQA-TTS	Low NISQA-TTS <3.0	Mean UTMOS	Low UTMOS <3.0
`cosyvoice_300m_instruct`	4.267	0.0%	3.282	20.8%
`indextts-2`	4.063	11.7%	2.078	93.3%
`qwen3_tts_customvoice_1_7b`	4.007	0.8%	2.939	51.7%
`fish_audio_s1_mini`	3.935	3.3%	2.932	55.8%
`voxcpm2`	3.788	8.3%	2.596	76.7%

Naturalness aur emotion correctness alag questions hain. CosyVoice naturalness winner hai, par emotion-control winner nahi. Qwen3-TTS NISQA-TTS mein thoda peeche hai, lekin emotion/text/speed trade-off better hai.

Listening examples

Neeche table Japanese aur Chinese happy aur angry samples ke liye same prompt index use karti hai. Yeh clips human listening test nahi hain; automatic metrics samajhne ke liye qualitative anchors hain.

Model	Language	Target	SenseVoice prediction
`qwen3_tts_customvoice_1_7b`	JA	happy	unknown
`qwen3_tts_customvoice_1_7b`	JA	angry	angry
`qwen3_tts_customvoice_1_7b`	ZH	happy	neutral
`qwen3_tts_customvoice_1_7b`	ZH	angry	angry
`cosyvoice_300m_instruct`	JA	happy	unknown
`cosyvoice_300m_instruct`	JA	angry	unknown
`cosyvoice_300m_instruct`	ZH	happy	happy
`cosyvoice_300m_instruct`	ZH	angry	neutral
`indextts-2`	JA	happy	sad
`indextts-2`	JA	angry	surprised
`indextts-2`	ZH	happy	neutral
`indextts-2`	ZH	angry	neutral
`fish_audio_s1_mini`	JA	happy	happy
`fish_audio_s1_mini`	JA	angry	happy
`fish_audio_s1_mini`	ZH	happy	neutral
`fish_audio_s1_mini`	ZH	angry	neutral
`voxcpm2`	JA	happy	unknown
`voxcpm2`	JA	angry	angry
`voxcpm2`	ZH	happy	happy
`voxcpm2`	ZH	angry	angry

Limitations

Automatic emotion labels human judgment nahi hain. SenseVoice useful hai kyunki Japanese/Chinese support karta hai, par classifier bias aur language imbalance possible hai.
Anchor metrics anchor datasets par depend karte hain. Japanese anchors JVNV se aur Chinese anchors CSEMOTIONS se aaye; is run mein ja/neutral aur zh/disgust missing the.
IndexTTS-2 Japanese diagnostic hai. Pooled score strong dikhta hai, par Japanese CER is setup mein bahut high hai.

Further research

Qwen3-TTS aur CosyVoice ke liye native-listener MOS/CMOS test chalana.
IndexTTS-2 ko abhi Chinese-focused candidate treat karna, ya Japanese tokenizer/text path fix ke baad rerun karna.
ja/neutral aur zh/disgust anchors add ya curate karna.
Chinese sad, angry, fear, aur disgust ke liye focused human check karna.
SenseVoice ko automatic screening metric rakhna, par production decisions human listening tests se lena.

Conclusion

Japanese aur Chinese emotional TTS ke liye Qwen3-TTS CustomVoice 1.7B is benchmark ka sabse balanced model hai. Yeh har emotion solve nahi karta, par emotion recognition, low CER, anchor hit rate, naturalness aur runtime ka sabse practical mix deta hai.

Emotional TTS Benchmark: Japanese aur Chinese ke liye Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio aur VoxCPM