
Japanese aur Chinese ke liye paanch emotional TTS models ka benchmark: chhe emotions, SenseVoice, emotion2vec anchors, CER, naturalness, runtime aur listening examples.
Models aur references:
Saaransh
Humne Japanese aur Chinese ke liye paanch emotional TTS systems ko chhe target emotions par benchmark kiya: neutral, happy, sad, angry, fear, aur disgust. Sentences neutral rakhe gaye, taaki emotion text se nahi balki speech style se aaye.
Sabse balanced candidate Qwen3-TTS CustomVoice 1.7B hai. Isme low CER, best anchor hit rate, strong naturalness, aur Japanese/Chinese ke liye sabse practical emotion balance dikha.
CosyVoice 300M Instruct naturalness mein aage hai, lekin emotion control weak hai. IndexTTS-2 ka pooled SenseVoice score achha dikhta hai, par Japanese CER bahut high hai. Chinese Japanese se easier hai, aur fear/disgust abhi unresolved hain.
Motivation
Emotional TTS sirf natural voice banane ka kaam nahi hai. Model ko sahi sentence bolna, sunne layak naturalness rakhna, aur requested emotion express karna hota hai. Isliye benchmark emotion recognition, emotion anchors, CER, naturalness, runtime aur listening samples ko saath dekhta hai.
- Intended Japanese ya Chinese sentence correct rehna chahiye.
- Speech real listening ke liye kaafi natural sound karni chahiye.
- Generated voice ko requested emotion express karna chahiye, neutral speech ya nearby emotion mein collapse nahi hona chahiye.
Evaluation methodology
Benchmark language, emotion aur prompt text ke balanced generation grid ka use karta hai. Same sentence ko chhe emotions ke liye reuse kiya gaya, taaki model prosody aur voice style se emotion express kare.
Prompt set
Japanese prompt ke udaharan:
| ID | Vakya |
|---|---|
ja_001 | 会議は午前十時に始まります。 |
ja_002 | 資料は机の上に置いてあります。 |
ja_003 | 明日の予定を確認してください。 |
ja_004 | 電車は三番線から出発します。 |
ja_005 | 受付で名前を伝えてください。 |
Chinese prompt ke udaharan:
| ID | Vakya |
|---|---|
zh_001 | 会议将在上午十点开始。 |
zh_002 | 资料已经放在桌子上。 |
zh_003 | 请确认明天的日程安排。 |
zh_004 | 列车将从三号站台出发。 |
zh_005 | 请在前台告知您的姓名。 |
Emotion controls
| Target emotion | Control text |
|---|---|
neutral | Speak in a clear, neutral, natural voice. |
happy | Speak in a happy, warm, bright voice. |
sad | Speak in a sad, soft, slow, gentle voice. |
angry | Speak in an angry, tense, forceful voice. |
fear | Speak in a fearful, tense, trembling voice. |
disgust | Speak in a disgusted, displeased, rejecting voice. |
Metrics
- SenseVoice emotion accuracy: main automatic screening metric.
- emotion2vec anchor hit aur margin: emotional-speech anchor centroids par based secondary diagnostic metric.
- CER: original prompt text ke against transcription error rate.
- NISQA-TTS aur UTMOS: synthesized speech ki naturalness aur quality diagnostic metrics.
- RTF: synthesis speed measure karne ke liye real-time factor.
Results
Resource usage
Resource metrics 600 successful generations se aaye hain. GPU, VRAM, wall time aur RTF sab completed rows mein hain; CPU server-backed adapters ke liye hamesha capture nahi hua.
| Model | Median wall time | Median RTF | Median peak VRAM | GPU util | GPU power | CPU | Median peak RSS |
|---|---|---|---|---|---|---|---|
cosyvoice_300m_instruct | 2.26s | 0.85 | 3.96 GB | 30.3% avg / 39.0% peak | 145.0W avg / 155.6W peak | 127.8% peak; 100% coverage | 5.54 GB |
qwen3_tts_customvoice_1_7b | 4.20s | 1.58 | 8.13 GB | 22.9% avg / 25.0% peak | 126.3W avg / 127.1W peak | 138.1% peak; 100% coverage | 6.22 GB |
fish_audio_s1_mini | 7.06s | 3.47 | 13.05 GB | 25.3% avg / 69.0% peak | 150.4W avg / 183.7W peak | not captured; 0% coverage | 0.80 GB |
indextts-2 | 26.39s | 6.97 | 7.29 GB | 18.2% avg / 100.0% peak | 131.3W avg / 199.6W peak | not captured; 0% coverage | 7.69 GB |
voxcpm2 | 28.44s | 9.84 | 12.79 GB | 12.3% avg / 100.0% peak | 106.7W avg / 191.5W peak | not captured; 0% coverage | 10.65 GB |
CosyVoice fastest aur lowest-VRAM model tha, par emotion-control candidate strongest nahi tha. Qwen3-TTS CosyVoice se zyada VRAM use karta hai, lekin IndexTTS-2 aur VoxCPM2 se kaafi fast hai aur best balance deta hai.
JA/ZH metrics overview
Yeh split table teen core automatic checks ko Japanese aur Chinese mein dikhata hai: SenseVoice emotion accuracy, CER text fidelity, aur emotion2vec anchor alignment.
| Model | JA SenseVoice | ZH SenseVoice | JA CER | ZH CER | JA anchor hit | ZH anchor hit | JA anchor margin | ZH anchor margin |
|---|---|---|---|---|---|---|---|---|
qwen3_tts_customvoice_1_7b | 15.0% | 53.3% | 8.6% | 9.7% | 40.0% | 64.0% | -0.06645 | 0.04480 |
indextts-2 | 43.3% | 16.7% | 91.0% | 10.3% | 38.0% | 30.0% | -0.08293 | -0.04063 |
voxcpm2 | 6.7% | 35.0% | 18.6% | 4.4% | 40.0% | 36.0% | -0.04479 | -0.02693 |
cosyvoice_300m_instruct | 1.7% | 36.7% | 43.9% | 11.1% | 24.0% | 72.0% | -0.05481 | 0.03796 |
fish_audio_s1_mini | 6.7% | 16.7% | 12.7% | 16.8% | 20.0% | 24.0% | -0.08972 | -0.09542 |
Chinese automatic emotion metrics mein generally easier hai, par CER aur emotion accuracy hamesha saath nahi move karte. Qwen3-TTS dono languages mein CER low rakhta hai; IndexTTS-2 ka Japanese SenseVoice score highest hai, par Japanese CER bhi worst hai.
Text fidelity (CER)
Text fidelity mein Qwen3-TTS sabse stable hai: Japanese CER 8.6% aur Chinese CER 9.7%. IndexTTS-2 warning case hai kyunki Japanese CER 91.0% tak pahunchta hai.
Emotion accuracy
SenseVoice
Is automatic setup mein Chinese Japanese se clearly easier hai. Qwen3-TTS ke liye Chinese SenseVoice accuracy 53.3% hai aur Japanese 15.0%, jabki CER dono mein low hai.
fear aur disgust sabse hard labels hain. Dono ka SenseVoice recall har model/language pair mein 0.0% hai, aur aksar sad, neutral, angry, ya unknown mein collapse hota hai.
Rows target emotions hain aur columns SenseVoice predictions hain. Green boxes ideal diagonal dikhate hain.
| Case | Kya hua | Yeh kyun important hai |
|---|---|---|
indextts-2 / ja | happy -> sad 4/10; fear -> sad 5/10; disgust -> angry 10/10. | Japanese text quality unreliable hone par bhi emotion labels plausible dikh sakte hain. |
qwen3_tts_customvoice_1_7b / zh | happy -> neutral 5/10; fear -> sad 9/10; disgust -> neutral 9/10. | Qwen balanced winner hai, lekin hard emotions abhi bhi collapse hote hain. |
cosyvoice_300m_instruct / ja | happy -> unknown 10/10; fear -> unknown 9/10; disgust -> unknown 8/10. | Naturalness recognizable emotional control ki guarantee nahi deti. |
fish_audio_s1_mini / zh | happy -> neutral 10/10; fear -> neutral 9/10; disgust -> neutral 8/10. | Inline emotion markers generated prosody ko reliably shift nahi kar paaye. |
voxcpm2 / zh | happy -> neutral 7/10; fear -> neutral 6/10; disgust -> neutral 10/10. | Prompt-driven control aksar neutral speech mein collapse hua. |
emotion2vec anchors
Anchor metric SenseVoice jaisi story batata hai: Chinese anchors Japanese anchors se zyada favorable hain. Positive margin ka matlab audio target emotion centroid ke closer hai. Qwen3-TTS Chinese margin positive hai, sab Japanese margins negative hain.
Naturalness
| Model | Mean NISQA-TTS | Low NISQA-TTS <3.0 | Mean UTMOS | Low UTMOS <3.0 |
|---|---|---|---|---|
cosyvoice_300m_instruct | 4.267 | 0.0% | 3.282 | 20.8% |
indextts-2 | 4.063 | 11.7% | 2.078 | 93.3% |
qwen3_tts_customvoice_1_7b | 4.007 | 0.8% | 2.939 | 51.7% |
fish_audio_s1_mini | 3.935 | 3.3% | 2.932 | 55.8% |
voxcpm2 | 3.788 | 8.3% | 2.596 | 76.7% |
Naturalness aur emotion correctness alag questions hain. CosyVoice naturalness winner hai, par emotion-control winner nahi. Qwen3-TTS NISQA-TTS mein thoda peeche hai, lekin emotion/text/speed trade-off better hai.
Listening examples
Neeche table Japanese aur Chinese happy aur angry samples ke liye same prompt index use karti hai. Yeh clips human listening test nahi hain; automatic metrics samajhne ke liye qualitative anchors hain.
| Model | Language | Target | SenseVoice prediction | Sample |
|---|---|---|---|---|
qwen3_tts_customvoice_1_7b | JA | happy | unknown | |
qwen3_tts_customvoice_1_7b | JA | angry | angry | |
qwen3_tts_customvoice_1_7b | ZH | happy | neutral | |
qwen3_tts_customvoice_1_7b | ZH | angry | angry | |
cosyvoice_300m_instruct | JA | happy | unknown | |
cosyvoice_300m_instruct | JA | angry | unknown | |
cosyvoice_300m_instruct | ZH | happy | happy | |
cosyvoice_300m_instruct | ZH | angry | neutral | |
indextts-2 | JA | happy | sad | |
indextts-2 | JA | angry | surprised | |
indextts-2 | ZH | happy | neutral | |
indextts-2 | ZH | angry | neutral | |
fish_audio_s1_mini | JA | happy | happy | |
fish_audio_s1_mini | JA | angry | happy | |
fish_audio_s1_mini | ZH | happy | neutral | |
fish_audio_s1_mini | ZH | angry | neutral | |
voxcpm2 | JA | happy | unknown | |
voxcpm2 | JA | angry | angry | |
voxcpm2 | ZH | happy | happy | |
voxcpm2 | ZH | angry | angry |
Limitations
- Automatic emotion labels human judgment nahi hain. SenseVoice useful hai kyunki Japanese/Chinese support karta hai, par classifier bias aur language imbalance possible hai.
- Anchor metrics anchor datasets par depend karte hain. Japanese anchors JVNV se aur Chinese anchors CSEMOTIONS se aaye; is run mein
ja/neutralaurzh/disgustmissing the. - IndexTTS-2 Japanese diagnostic hai. Pooled score strong dikhta hai, par Japanese CER is setup mein bahut high hai.
Further research
- Qwen3-TTS aur CosyVoice ke liye native-listener MOS/CMOS test chalana.
- IndexTTS-2 ko abhi Chinese-focused candidate treat karna, ya Japanese tokenizer/text path fix ke baad rerun karna.
ja/neutralaurzh/disgustanchors add ya curate karna.- Chinese
sad,angry,fear, aurdisgustke liye focused human check karna. - SenseVoice ko automatic screening metric rakhna, par production decisions human listening tests se lena.
Conclusion
Japanese aur Chinese emotional TTS ke liye Qwen3-TTS CustomVoice 1.7B is benchmark ka sabse balanced model hai. Yeh har emotion solve nahi karta, par emotion recognition, low CER, anchor hit rate, naturalness aur runtime ka sabse practical mix deta hai.