जापानी और चीनी भावनात्मक TTS बेंचमार्क

मॉडल और संदर्भ:

सारांश

हमने जापानी और चीनी के लिए पाँच भावनात्मक TTS सिस्टम को छह लक्ष्य भावनाओं पर benchmark किया: neutral, happy, sad, angry, fear, और disgust। वाक्य neutral रखे गए ताकि भावना text से नहीं बल्कि speech style से आए।

सबसे संतुलित उम्मीदवार Qwen3-TTS CustomVoice 1.7B है। इसमें कम CER, मजबूत anchor hit rate, अच्छी naturalness और जापानी/चीनी दोनों के लिए सबसे व्यावहारिक emotion-recognition balance दिखा।

CosyVoice 300M Instruct naturalness में आगे है, पर emotion control कमजोर है। IndexTTS-2 का pooled SenseVoice score अच्छा दिखता है, लेकिन जापानी CER बहुत अधिक है, इसलिए इसे भरोसेमंद जापानी TTS evidence नहीं माना जा सकता। कुल मिलाकर चीनी जापानी से आसान है, और fear तथा disgust अभी भी unresolved हैं।

प्रेरणा

भावनात्मक TTS सिर्फ natural voice बनाने का काम नहीं है। मॉडल को सही वाक्य बोलना, सुनने योग्य naturalness रखना, और requested emotion व्यक्त करना होता है। इसलिए यह benchmark emotion recognition, emotion anchors, transcription correctness, naturalness, runtime और listening samples को साथ देखता है।

लक्षित जापानी या चीनी वाक्य सही बना रहना चाहिए।
आवाज वास्तविक सुनने के लिए पर्याप्त प्राकृतिक लगनी चाहिए।
generated voice को requested emotion व्यक्त करना चाहिए, neutral speech या किसी नजदीकी emotion में collapse नहीं होना चाहिए।

मूल्यांकन पद्धति

Benchmark language, emotion और prompt text के balanced generation grid पर आधारित है। एक ही sentence सभी छह emotions में reuse किया गया, ताकि मॉडल emotional wording पर नहीं बल्कि prosody और voice style पर निर्भर करे।

Experiment design

प्रॉम्प्ट सेट

जापानी prompt उदाहरण:

ID	वाक्य
`ja_001`	会議は午前十時に始まります。
`ja_002`	資料は机の上に置いてあります。
`ja_003`	明日の予定を確認してください。
`ja_004`	電車は三番線から出発します。
`ja_005`	受付で名前を伝えてください。

चीनी prompt उदाहरण:

ID	वाक्य
`zh_001`	会议将在上午十点开始。
`zh_002`	资料已经放在桌子上。
`zh_003`	请确认明天的日程安排。
`zh_004`	列车将从三号站台出发。
`zh_005`	请在前台告知您的姓名。

भावना नियंत्रण

Target emotion	Control text
`neutral`	Speak in a clear, neutral, natural voice.
`happy`	Speak in a happy, warm, bright voice.
`sad`	Speak in a sad, soft, slow, gentle voice.
`angry`	Speak in an angry, tense, forceful voice.
`fear`	Speak in a fearful, tense, trembling voice.
`disgust`	Speak in a disgusted, displeased, rejecting voice.

मेट्रिक्स

SenseVoice भावना सटीकता: मुख्य automatic screening metric।
emotion2vec anchor hit और margin: emotional-speech anchor centroids पर आधारित secondary diagnostic metric।
CER: original prompt text के मुकाबले transcription error rate।
NISQA-TTS और UTMOS: synthesized speech की naturalness और quality diagnostic metrics।
RTF: synthesis speed मापने के लिए real-time factor।

परिणाम

संसाधन उपयोग

Resource metrics 600 सफल generations से लिए गए हैं। GPU, VRAM, wall time और RTF सभी completed rows में उपलब्ध हैं; CPU server-backed adapters के लिए हमेशा capture नहीं हुआ।

Model	Median wall time	Median RTF	Median peak VRAM	GPU util	GPU power	CPU	Median peak RSS
`cosyvoice_300m_instruct`	2.26s	0.85	3.96 GB	30.3% avg / 39.0% peak	145.0W avg / 155.6W peak	127.8% peak; 100% coverage	5.54 GB
`qwen3_tts_customvoice_1_7b`	4.20s	1.58	8.13 GB	22.9% avg / 25.0% peak	126.3W avg / 127.1W peak	138.1% peak; 100% coverage	6.22 GB
`fish_audio_s1_mini`	7.06s	3.47	13.05 GB	25.3% avg / 69.0% peak	150.4W avg / 183.7W peak	not captured; 0% coverage	0.80 GB
`indextts-2`	26.39s	6.97	7.29 GB	18.2% avg / 100.0% peak	131.3W avg / 199.6W peak	not captured; 0% coverage	7.69 GB
`voxcpm2`	28.44s	9.84	12.79 GB	12.3% avg / 100.0% peak	106.7W avg / 191.5W peak	not captured; 0% coverage	10.65 GB

CosyVoice सबसे तेज और सबसे कम VRAM वाला model था, पर emotion-control candidate सबसे मजबूत नहीं था। Qwen3-TTS CosyVoice से अधिक VRAM उपयोग करता है, लेकिन IndexTTS-2 और VoxCPM2 से बहुत तेज है और emotion recognition/text fidelity का best balance देता है।

JA/ZH मेट्रिक्स अवलोकन

यह split table तीन core automatic checks को जापानी और चीनी में अलग करके दिखाता है: SenseVoice emotion accuracy, CER text fidelity, और emotion2vec anchor alignment।

Model	JA SenseVoice	ZH SenseVoice	JA CER	ZH CER	JA anchor hit	ZH anchor hit	JA anchor margin	ZH anchor margin
`qwen3_tts_customvoice_1_7b`	15.0%	53.3%	8.6%	9.7%	40.0%	64.0%	-0.06645	0.04480
`indextts-2`	43.3%	16.7%	91.0%	10.3%	38.0%	30.0%	-0.08293	-0.04063
`voxcpm2`	6.7%	35.0%	18.6%	4.4%	40.0%	36.0%	-0.04479	-0.02693
`cosyvoice_300m_instruct`	1.7%	36.7%	43.9%	11.1%	24.0%	72.0%	-0.05481	0.03796
`fish_audio_s1_mini`	6.7%	16.7%	12.7%	16.8%	20.0%	24.0%	-0.08972	-0.09542

Automatic emotion metrics में चीनी सामान्यतः आसान है, लेकिन CER और emotion accuracy हमेशा साथ नहीं चलते। Qwen3-TTS दोनों भाषाओं में CER कम रखता है; IndexTTS-2 का जापानी SenseVoice score सबसे ऊँचा है, लेकिन जापानी CER भी सबसे खराब है।

टेक्स्ट fidelity (CER)

CER by language

Text fidelity में Qwen3-TTS सबसे स्थिर है: जापानी CER 8.6% और चीनी CER 9.7%। IndexTTS-2 warning case है, क्योंकि जापानी CER 91.0% तक पहुँचता है।

भावना सटीकता

SenseVoice

SenseVoice accuracy by language

इस automatic setup में चीनी जापानी से स्पष्ट रूप से आसान है। Qwen3-TTS के लिए चीनी SenseVoice accuracy 53.3% है जबकि जापानी 15.0% है, हालांकि CER दोनों में low है। यानी समस्या केवल intelligibility नहीं, बल्कि जापानी emotion cues की weakness या SenseVoice alignment है।

Per-emotion SenseVoice recall by model and language

fear और disgust सबसे कठिन labels हैं। सभी model/language pairs में इनका SenseVoice recall 0.0% है, और वे अक्सर sad, neutral, angry, या unknown में collapse होते हैं।

Rows target emotions हैं और columns SenseVoice predictions हैं। हरे boxes ideal diagonal को दिखाते हैं।

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

मामला	क्या हुआ	यह क्यों महत्वपूर्ण है
`indextts-2 / ja`	`happy` -> `sad` 4/10; `fear` -> `sad` 5/10; `disgust` -> `angry` 10/10.	Japanese text quality unreliable होने पर भी emotion labels plausible दिख सकते हैं।
`qwen3_tts_customvoice_1_7b / zh`	`happy` -> `neutral` 5/10; `fear` -> `sad` 9/10; `disgust` -> `neutral` 9/10.	Qwen balanced winner है, लेकिन hard emotions अभी भी collapse होते हैं।
`cosyvoice_300m_instruct / ja`	`happy` -> `unknown` 10/10; `fear` -> `unknown` 9/10; `disgust` -> `unknown` 8/10.	Naturalness recognizable emotional control की guarantee नहीं देती।
`fish_audio_s1_mini / zh`	`happy` -> `neutral` 10/10; `fear` -> `neutral` 9/10; `disgust` -> `neutral` 8/10.	Inline emotion markers generated prosody को reliably shift नहीं कर पाए।
`voxcpm2 / zh`	`happy` -> `neutral` 7/10; `fear` -> `neutral` 6/10; `disgust` -> `neutral` 10/10.	Prompt-driven control अक्सर neutral speech में collapse हुआ।

emotion2vec anchors

emotion2vec anchor hit and margin by language

Anchor metric SenseVoice जैसी कहानी बताता है: चीनी anchors जापानी anchors से अधिक favorable हैं। Positive margin का अर्थ है कि generated audio target-emotion centroid के करीब है। Qwen3-TTS का चीनी margin positive है, जबकि सभी जापानी margins negative हैं।

Naturalness

Naturalness diagnostics by model

Model	Mean NISQA-TTS	Low NISQA-TTS <3.0	Mean UTMOS	Low UTMOS <3.0
`cosyvoice_300m_instruct`	4.267	0.0%	3.282	20.8%
`indextts-2`	4.063	11.7%	2.078	93.3%
`qwen3_tts_customvoice_1_7b`	4.007	0.8%	2.939	51.7%
`fish_audio_s1_mini`	3.935	3.3%	2.932	55.8%
`voxcpm2`	3.788	8.3%	2.596	76.7%

Naturalness और emotion correctness अलग प्रश्न हैं। CosyVoice naturalness winner है, लेकिन emotion-control winner नहीं। Qwen3-TTS NISQA-TTS में थोड़ा पीछे है, पर emotion, text correctness और speed का better practical trade-off देता है।

सुनने के उदाहरण

नीचे की table जापानी और चीनी में happy और angry के लिए same prompt index इस्तेमाल करती है। ये clips human listening test नहीं हैं; ये automatic metrics समझने के लिए qualitative anchors हैं।

Model	Language	Target	SenseVoice prediction
`qwen3_tts_customvoice_1_7b`	JA	happy	unknown
`qwen3_tts_customvoice_1_7b`	JA	angry	angry
`qwen3_tts_customvoice_1_7b`	ZH	happy	neutral
`qwen3_tts_customvoice_1_7b`	ZH	angry	angry
`cosyvoice_300m_instruct`	JA	happy	unknown
`cosyvoice_300m_instruct`	JA	angry	unknown
`cosyvoice_300m_instruct`	ZH	happy	happy
`cosyvoice_300m_instruct`	ZH	angry	neutral
`indextts-2`	JA	happy	sad
`indextts-2`	JA	angry	surprised
`indextts-2`	ZH	happy	neutral
`indextts-2`	ZH	angry	neutral
`fish_audio_s1_mini`	JA	happy	happy
`fish_audio_s1_mini`	JA	angry	happy
`fish_audio_s1_mini`	ZH	happy	neutral
`fish_audio_s1_mini`	ZH	angry	neutral
`voxcpm2`	JA	happy	unknown
`voxcpm2`	JA	angry	angry
`voxcpm2`	ZH	happy	happy
`voxcpm2`	ZH	angry	angry

सीमाएँ

Automatic emotion labels human judgment नहीं हैं। SenseVoice Japanese और Chinese support करता है, पर classifier bias और language imbalance संभव हैं।
Anchor metrics anchor datasets पर निर्भर करते हैं। Japanese anchors JVNV से और Chinese anchors CSEMOTIONS से आए; इस run में ja/neutral और zh/disgust anchors missing थे।
IndexTTS-2 Japanese diagnostic है, production evidence नहीं। Pooled emotion score strong दिखता है, पर Japanese CER इस setup में बहुत अधिक है।

आगे का शोध

Qwen3-TTS और CosyVoice के लिए native-listener MOS/CMOS test चलाएँ, जिसमें naturalness, emotion correctness और intelligibility अलग-अलग rate हों।
IndexTTS-2 को फिलहाल Chinese-focused मानें, या Japanese tokenizer/text path fix करने के बाद rerun करें।
ja/neutral और zh/disgust emotion anchors add या curate करें।
Chinese sad, angry, fear, और disgust के लिए focused human check करें।
SenseVoice को automatic screening metric रखें, लेकिन production decisions human listening tests से लें।

निष्कर्ष

Japanese और Chinese emotional TTS के लिए Qwen3-TTS CustomVoice 1.7B इस benchmark में सबसे संतुलित model है। यह हर emotion solve नहीं करता, लेकिन emotion recognition, low CER, anchor hit rate, naturalness और runtime का सबसे practical mix देता है।

भावनात्मक TTS बेंचमार्क: जापानी और चीनी के लिए Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio और VoxCPM