जापानी र चिनियाँ भावनात्मक TTS बेन्चमार्क

मोडेल र सन्दर्भ:

सारांश

हामीले जापानी र चिनियाँका लागि पाँच भावनात्मक TTS प्रणालीलाई neutral, happy, sad, angry, fear, disgust गरी छ लक्ष्य भावनामा benchmark गर्‍यौं। वाक्य neutral राखिएको छ, त्यसैले भावना speech style बाट आउनुपर्छ।

सबैभन्दा सन्तुलित candidate Qwen3-TTS CustomVoice 1.7B हो। यसमा कम CER, राम्रो anchor hit rate, बलियो naturalness र जापानी/चिनियाँका लागि सबैभन्दा व्यावहारिक balance देखियो।

CosyVoice 300M Instruct naturalness मा अगाडि छ, तर emotion control कमजोर छ। IndexTTS-2 pooled SenseVoice score मा राम्रो देखिए पनि जापानी CER धेरै उच्च छ। चिनियाँ जापानीभन्दा सजिलो छ, र fear तथा disgust अझै समाधान भएका छैनन्।

प्रेरणा

भावनात्मक TTS प्राकृतिक आवाज बनाउने कुरा मात्र होइन। मोडेलले सही वाक्य बोल्नुपर्छ, सुन्न योग्य हुनुपर्छ र मागिएको भावना व्यक्त गर्नुपर्छ। त्यसैले यो benchmark ले emotion recognition, anchors, CER, naturalness, runtime र audio samples सँगै हेर्छ।

लक्षित जापानी वा चिनियाँ वाक्य सही रहनुपर्छ।
आवाज वास्तविक रूपमा सुन्न पर्याप्त प्राकृतिक हुनुपर्छ।
generated voice ले requested emotion व्यक्त गर्नुपर्छ, neutral speech वा नजिकको emotion मा collapse हुनु हुँदैन।

मूल्याङ्कन विधि

Benchmark ले भाषा, भावना र prompt text अनुसार balanced generation grid प्रयोग गर्छ। एउटै sentence छ भावनामा प्रयोग गरिएको छ, ताकि model ले prosody र voice style बाट भावना व्यक्त गर्नुपर्छ।

Experiment design

Prompt set

जापानी prompt उदाहरणहरू:

ID	वाक्य
`ja_001`	会議は午前十時に始まります。
`ja_002`	資料は机の上に置いてあります。
`ja_003`	明日の予定を確認してください。
`ja_004`	電車は三番線から出発します。
`ja_005`	受付で名前を伝えてください。

चिनियाँ prompt उदाहरणहरू:

ID	वाक्य
`zh_001`	会议将在上午十点开始。
`zh_002`	资料已经放在桌子上。
`zh_003`	请确认明天的日程安排。
`zh_004`	列车将从三号站台出发。
`zh_005`	请在前台告知您的姓名。

भावना नियन्त्रण

Target emotion	Control text
`neutral`	Speak in a clear, neutral, natural voice.
`happy`	Speak in a happy, warm, bright voice.
`sad`	Speak in a sad, soft, slow, gentle voice.
`angry`	Speak in an angry, tense, forceful voice.
`fear`	Speak in a fearful, tense, trembling voice.
`disgust`	Speak in a disgusted, displeased, rejecting voice.

मेट्रिक

SenseVoice भावना शुद्धता: मुख्य automatic screening metric।
emotion2vec anchor hit र margin: emotional-speech anchor centroids मा आधारित secondary diagnostic metric।
CER: original prompt text सँग transcription तुलना गर्दा आउने character error rate।
NISQA-TTS र UTMOS: synthesized speech को naturalness र quality जाँच्ने diagnostic metrics।
RTF: synthesis speed मापन गर्ने real-time factor।

नतिजा

स्रोत प्रयोग

Resource metrics 600 सफल generations बाट लिइएको छ। GPU, VRAM, wall time र RTF सबै completed rows मा छन्; CPU server-backed adapters का लागि सधैं capture हुँदैन।

Model	Median wall time	Median RTF	Median peak VRAM	GPU util	GPU power	CPU	Median peak RSS
`cosyvoice_300m_instruct`	2.26s	0.85	3.96 GB	30.3% avg / 39.0% peak	145.0W avg / 155.6W peak	127.8% peak; 100% coverage	5.54 GB
`qwen3_tts_customvoice_1_7b`	4.20s	1.58	8.13 GB	22.9% avg / 25.0% peak	126.3W avg / 127.1W peak	138.1% peak; 100% coverage	6.22 GB
`fish_audio_s1_mini`	7.06s	3.47	13.05 GB	25.3% avg / 69.0% peak	150.4W avg / 183.7W peak	not captured; 0% coverage	0.80 GB
`indextts-2`	26.39s	6.97	7.29 GB	18.2% avg / 100.0% peak	131.3W avg / 199.6W peak	not captured; 0% coverage	7.69 GB
`voxcpm2`	28.44s	9.84	12.79 GB	12.3% avg / 100.0% peak	106.7W avg / 191.5W peak	not captured; 0% coverage	10.65 GB

CosyVoice सबैभन्दा छिटो र सबैभन्दा कम VRAM प्रयोग गर्ने model हो, तर emotion control मा सबैभन्दा बलियो होइन। Qwen3-TTS ले CosyVoice भन्दा बढी VRAM प्रयोग गर्छ, तर IndexTTS-2 र VoxCPM2 भन्दा धेरै छिटो छ र राम्रो balance दिन्छ।

JA/ZH मेट्रिक अवलोकन

यो तालिकाले तीन मुख्य automatic checks लाई जापानी र चिनियाँमा छुट्याउँछ: SenseVoice emotion accuracy, CER र emotion2vec anchor alignment।

Model	JA SenseVoice	ZH SenseVoice	JA CER	ZH CER	JA anchor hit	ZH anchor hit	JA anchor margin	ZH anchor margin
`qwen3_tts_customvoice_1_7b`	15.0%	53.3%	8.6%	9.7%	40.0%	64.0%	-0.06645	0.04480
`indextts-2`	43.3%	16.7%	91.0%	10.3%	38.0%	30.0%	-0.08293	-0.04063
`voxcpm2`	6.7%	35.0%	18.6%	4.4%	40.0%	36.0%	-0.04479	-0.02693
`cosyvoice_300m_instruct`	1.7%	36.7%	43.9%	11.1%	24.0%	72.0%	-0.05481	0.03796
`fish_audio_s1_mini`	6.7%	16.7%	12.7%	16.8%	20.0%	24.0%	-0.08972	-0.09542

Automatic emotion metrics मा चिनियाँ सामान्यतया सजिलो छ, तर CER र emotion accuracy सधैं एउटै दिशामा हिँड्दैनन्। Qwen3-TTS ले दुवै भाषामा CER कम राख्छ; IndexTTS-2 को जापानी SenseVoice score उच्च छ तर जापानी CER पनि सबैभन्दा खराब छ।

पाठ शुद्धता (CER)

CER by language

Text fidelity मा Qwen3-TTS सबैभन्दा स्थिर छ: जापानी CER 8.6% र चिनियाँ CER 9.7%। IndexTTS-2 warning case हो, किनभने जापानी CER 91.0% पुग्छ।

भावना शुद्धता

SenseVoice

SenseVoice accuracy by language

यस automatic setup मा चिनियाँ जापानीभन्दा स्पष्ट रूपमा सजिलो छ। Qwen3-TTS मा चिनियाँ SenseVoice accuracy 53.3% र जापानी 15.0% छ, यद्यपि CER दुवैमा कम छ।

Per-emotion SenseVoice recall by model and language

fear र disgust सबैभन्दा कठिन labels हुन्। दुवैको SenseVoice recall सबै model/language pairs मा 0.0% छ र प्रायः sad, neutral, angry, वा unknown मा जान्छ।

Rows target emotions हुन् र columns SenseVoice predictions हुन्। हरियो boxes ideal diagonal हुन्।

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

केस	के भयो	किन महत्त्वपूर्ण छ
`indextts-2 / ja`	`happy` -> `sad` 4/10; `fear` -> `sad` 5/10; `disgust` -> `angry` 10/10.	जापानी text quality unreliable हुँदा पनि emotion labels plausible देखिन सक्छन्।
`qwen3_tts_customvoice_1_7b / zh`	`happy` -> `neutral` 5/10; `fear` -> `sad` 9/10; `disgust` -> `neutral` 9/10.	Qwen सबैभन्दा balanced candidate हो, तर hard emotions अझै collapse हुन्छन्।
`cosyvoice_300m_instruct / ja`	`happy` -> `unknown` 10/10; `fear` -> `unknown` 9/10; `disgust` -> `unknown` 8/10.	Naturalness ले recognizable emotional control को guarantee गर्दैन।
`fish_audio_s1_mini / zh`	`happy` -> `neutral` 10/10; `fear` -> `neutral` 9/10; `disgust` -> `neutral` 8/10.	Inline emotion markers ले generated prosody लाई reliably shift गर्न सकेन।
`voxcpm2 / zh`	`happy` -> `neutral` 7/10; `fear` -> `neutral` 6/10; `disgust` -> `neutral` 10/10.	Prompt-driven control अक्सर neutral speech मा collapse भयो।

emotion2vec anchors

emotion2vec anchor hit and margin by language

Anchor metric ले SenseVoice जस्तै story देखाउँछ: चिनियाँ anchors जापानी anchors भन्दा favorable छन्। Positive margin भनेको generated audio target emotion centroid नजिक छ। Qwen3-TTS को चिनियाँ margin positive छ, सबै जापानी margins negative छन्।

Naturalness

Naturalness diagnostics by model

Model	Mean NISQA-TTS	Low NISQA-TTS <3.0	Mean UTMOS	Low UTMOS <3.0
`cosyvoice_300m_instruct`	4.267	0.0%	3.282	20.8%
`indextts-2`	4.063	11.7%	2.078	93.3%
`qwen3_tts_customvoice_1_7b`	4.007	0.8%	2.939	51.7%
`fish_audio_s1_mini`	3.935	3.3%	2.932	55.8%
`voxcpm2`	3.788	8.3%	2.596	76.7%

Naturalness र emotion correctness फरक प्रश्न हुन्। CosyVoice naturalness मा जित्छ, तर emotion control मा होइन। Qwen3-TTS NISQA-TTS मा अलि पछाडि छ, तर emotion/text/speed trade-off राम्रो छ।

सुन्ने उदाहरण

तलको तालिकाले जापानी र चिनियाँ happy र angry samples का लागि एउटै prompt index प्रयोग गर्छ। यी clips human listening test होइनन्, automatic metrics बुझ्न qualitative anchors हुन्।

Model	Language	Target	SenseVoice prediction
`qwen3_tts_customvoice_1_7b`	JA	happy	unknown
`qwen3_tts_customvoice_1_7b`	JA	angry	angry
`qwen3_tts_customvoice_1_7b`	ZH	happy	neutral
`qwen3_tts_customvoice_1_7b`	ZH	angry	angry
`cosyvoice_300m_instruct`	JA	happy	unknown
`cosyvoice_300m_instruct`	JA	angry	unknown
`cosyvoice_300m_instruct`	ZH	happy	happy
`cosyvoice_300m_instruct`	ZH	angry	neutral
`indextts-2`	JA	happy	sad
`indextts-2`	JA	angry	surprised
`indextts-2`	ZH	happy	neutral
`indextts-2`	ZH	angry	neutral
`fish_audio_s1_mini`	JA	happy	happy
`fish_audio_s1_mini`	JA	angry	happy
`fish_audio_s1_mini`	ZH	happy	neutral
`fish_audio_s1_mini`	ZH	angry	neutral
`voxcpm2`	JA	happy	unknown
`voxcpm2`	JA	angry	angry
`voxcpm2`	ZH	happy	happy
`voxcpm2`	ZH	angry	angry

सीमाहरू

Automatic emotion labels human judgment होइनन्। SenseVoice उपयोगी छ, तर classifier bias र language imbalance हुन सक्छ।
Anchor metrics anchor datasets मा निर्भर हुन्छन्। जापानी anchors JVNV बाट र चिनियाँ anchors CSEMOTIONS बाट आएका छन्; ja/neutral र zh/disgust हराइरहेका थिए।
IndexTTS-2 Japanese diagnostic मात्र हो। Pooled score राम्रो देखिन्छ, तर जापानी CER यो setup मा धेरै उच्च छ।

थप अनुसन्धान

Qwen3-TTS र CosyVoice का लागि native-listener MOS/CMOS test चलाउने।
IndexTTS-2 लाई अहिले Chinese-focused candidate मान्ने, वा Japanese tokenizer/text path fix गरेपछि rerun गर्ने।
ja/neutral र zh/disgust anchors थप्ने वा curate गर्ने।
Chinese sad, angry, fear, disgust को focused human check गर्ने।
SenseVoice लाई automatic screening metric राख्ने, तर production decisions human listening tests बाट गर्ने।

निष्कर्ष

जापानी र चिनियाँ भावनात्मक TTS का लागि Qwen3-TTS CustomVoice 1.7B यो benchmark मा सबैभन्दा सन्तुलित model हो। यसले सबै emotion समाधान गर्दैन, तर emotion recognition, low CER, anchor hit rate, naturalness र runtime को सबैभन्दा practical mix दिन्छ।

भावनात्मक TTS बेन्चमार्क: जापानी र चिनियाँका लागि Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio र VoxCPM