Japanese aur Chinese Emotional TTS Benchmark | VoicePing
Text to Speech Emotional TTS Benchmark Qwen3 TTS CosyVoice IndexTTS Fish Audio VoxCPM Japanese Chinese Speech AI

Emotional TTS Benchmark: Japanese aur Chinese ke liye Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio aur VoxCPM

VoicePing Research 7 min read
Emotional TTS Benchmark: Japanese aur Chinese ke liye Qwen3-TTS, CosyVoice, IndexTTS-2, Fish Audio aur VoxCPM

Japanese aur Chinese ke liye paanch emotional TTS models ka benchmark: chhe emotions, SenseVoice, emotion2vec anchors, CER, naturalness, runtime aur listening examples.

Models aur references:

Saaransh

Humne Japanese aur Chinese ke liye paanch emotional TTS systems ko chhe target emotions par benchmark kiya: neutral, happy, sad, angry, fear, aur disgust. Sentences neutral rakhe gaye, taaki emotion text se nahi balki speech style se aaye.

Sabse balanced candidate Qwen3-TTS CustomVoice 1.7B hai. Isme low CER, best anchor hit rate, strong naturalness, aur Japanese/Chinese ke liye sabse practical emotion balance dikha.

CosyVoice 300M Instruct naturalness mein aage hai, lekin emotion control weak hai. IndexTTS-2 ka pooled SenseVoice score achha dikhta hai, par Japanese CER bahut high hai. Chinese Japanese se easier hai, aur fear/disgust abhi unresolved hain.

Motivation

Emotional TTS sirf natural voice banane ka kaam nahi hai. Model ko sahi sentence bolna, sunne layak naturalness rakhna, aur requested emotion express karna hota hai. Isliye benchmark emotion recognition, emotion anchors, CER, naturalness, runtime aur listening samples ko saath dekhta hai.

  • Intended Japanese ya Chinese sentence correct rehna chahiye.
  • Speech real listening ke liye kaafi natural sound karni chahiye.
  • Generated voice ko requested emotion express karna chahiye, neutral speech ya nearby emotion mein collapse nahi hona chahiye.

Evaluation methodology

Benchmark language, emotion aur prompt text ke balanced generation grid ka use karta hai. Same sentence ko chhe emotions ke liye reuse kiya gaya, taaki model prosody aur voice style se emotion express kare.

Experiment design

Prompt set

Japanese prompt ke udaharan:

IDVakya
ja_001会議は午前十時に始まります。
ja_002資料は机の上に置いてあります。
ja_003明日の予定を確認してください。
ja_004電車は三番線から出発します。
ja_005受付で名前を伝えてください。

Chinese prompt ke udaharan:

IDVakya
zh_001会议将在上午十点开始。
zh_002资料已经放在桌子上。
zh_003请确认明天的日程安排。
zh_004列车将从三号站台出发。
zh_005请在前台告知您的姓名。

Emotion controls

Target emotionControl text
neutralSpeak in a clear, neutral, natural voice.
happySpeak in a happy, warm, bright voice.
sadSpeak in a sad, soft, slow, gentle voice.
angrySpeak in an angry, tense, forceful voice.
fearSpeak in a fearful, tense, trembling voice.
disgustSpeak in a disgusted, displeased, rejecting voice.

Metrics

  • SenseVoice emotion accuracy: main automatic screening metric.
  • emotion2vec anchor hit aur margin: emotional-speech anchor centroids par based secondary diagnostic metric.
  • CER: original prompt text ke against transcription error rate.
  • NISQA-TTS aur UTMOS: synthesized speech ki naturalness aur quality diagnostic metrics.
  • RTF: synthesis speed measure karne ke liye real-time factor.

Results

Resource usage

Resource metrics 600 successful generations se aaye hain. GPU, VRAM, wall time aur RTF sab completed rows mein hain; CPU server-backed adapters ke liye hamesha capture nahi hua.

ModelMedian wall timeMedian RTFMedian peak VRAMGPU utilGPU powerCPUMedian peak RSS
cosyvoice_300m_instruct2.26s0.853.96 GB30.3% avg / 39.0% peak145.0W avg / 155.6W peak127.8% peak; 100% coverage5.54 GB
qwen3_tts_customvoice_1_7b4.20s1.588.13 GB22.9% avg / 25.0% peak126.3W avg / 127.1W peak138.1% peak; 100% coverage6.22 GB
fish_audio_s1_mini7.06s3.4713.05 GB25.3% avg / 69.0% peak150.4W avg / 183.7W peaknot captured; 0% coverage0.80 GB
indextts-226.39s6.977.29 GB18.2% avg / 100.0% peak131.3W avg / 199.6W peaknot captured; 0% coverage7.69 GB
voxcpm228.44s9.8412.79 GB12.3% avg / 100.0% peak106.7W avg / 191.5W peaknot captured; 0% coverage10.65 GB

CosyVoice fastest aur lowest-VRAM model tha, par emotion-control candidate strongest nahi tha. Qwen3-TTS CosyVoice se zyada VRAM use karta hai, lekin IndexTTS-2 aur VoxCPM2 se kaafi fast hai aur best balance deta hai.

JA/ZH metrics overview

Yeh split table teen core automatic checks ko Japanese aur Chinese mein dikhata hai: SenseVoice emotion accuracy, CER text fidelity, aur emotion2vec anchor alignment.

ModelJA SenseVoiceZH SenseVoiceJA CERZH CERJA anchor hitZH anchor hitJA anchor marginZH anchor margin
qwen3_tts_customvoice_1_7b15.0%53.3%8.6%9.7%40.0%64.0%-0.066450.04480
indextts-243.3%16.7%91.0%10.3%38.0%30.0%-0.08293-0.04063
voxcpm26.7%35.0%18.6%4.4%40.0%36.0%-0.04479-0.02693
cosyvoice_300m_instruct1.7%36.7%43.9%11.1%24.0%72.0%-0.054810.03796
fish_audio_s1_mini6.7%16.7%12.7%16.8%20.0%24.0%-0.08972-0.09542

Chinese automatic emotion metrics mein generally easier hai, par CER aur emotion accuracy hamesha saath nahi move karte. Qwen3-TTS dono languages mein CER low rakhta hai; IndexTTS-2 ka Japanese SenseVoice score highest hai, par Japanese CER bhi worst hai.

Text fidelity (CER)

CER by language

Text fidelity mein Qwen3-TTS sabse stable hai: Japanese CER 8.6% aur Chinese CER 9.7%. IndexTTS-2 warning case hai kyunki Japanese CER 91.0% tak pahunchta hai.

Emotion accuracy

SenseVoice

SenseVoice accuracy by language

Is automatic setup mein Chinese Japanese se clearly easier hai. Qwen3-TTS ke liye Chinese SenseVoice accuracy 53.3% hai aur Japanese 15.0%, jabki CER dono mein low hai.

Per-emotion SenseVoice recall by model and language

fear aur disgust sabse hard labels hain. Dono ka SenseVoice recall har model/language pair mein 0.0% hai, aur aksar sad, neutral, angry, ya unknown mein collapse hota hai.

Rows target emotions hain aur columns SenseVoice predictions hain. Green boxes ideal diagonal dikhate hain.

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

CaseKya huaYeh kyun important hai
indextts-2 / jahappy -> sad 4/10; fear -> sad 5/10; disgust -> angry 10/10.Japanese text quality unreliable hone par bhi emotion labels plausible dikh sakte hain.
qwen3_tts_customvoice_1_7b / zhhappy -> neutral 5/10; fear -> sad 9/10; disgust -> neutral 9/10.Qwen balanced winner hai, lekin hard emotions abhi bhi collapse hote hain.
cosyvoice_300m_instruct / jahappy -> unknown 10/10; fear -> unknown 9/10; disgust -> unknown 8/10.Naturalness recognizable emotional control ki guarantee nahi deti.
fish_audio_s1_mini / zhhappy -> neutral 10/10; fear -> neutral 9/10; disgust -> neutral 8/10.Inline emotion markers generated prosody ko reliably shift nahi kar paaye.
voxcpm2 / zhhappy -> neutral 7/10; fear -> neutral 6/10; disgust -> neutral 10/10.Prompt-driven control aksar neutral speech mein collapse hua.

emotion2vec anchors

emotion2vec anchor hit and margin by language

Anchor metric SenseVoice jaisi story batata hai: Chinese anchors Japanese anchors se zyada favorable hain. Positive margin ka matlab audio target emotion centroid ke closer hai. Qwen3-TTS Chinese margin positive hai, sab Japanese margins negative hain.

Naturalness

Naturalness diagnostics by model

ModelMean NISQA-TTSLow NISQA-TTS <3.0Mean UTMOSLow UTMOS <3.0
cosyvoice_300m_instruct4.2670.0%3.28220.8%
indextts-24.06311.7%2.07893.3%
qwen3_tts_customvoice_1_7b4.0070.8%2.93951.7%
fish_audio_s1_mini3.9353.3%2.93255.8%
voxcpm23.7888.3%2.59676.7%

Naturalness aur emotion correctness alag questions hain. CosyVoice naturalness winner hai, par emotion-control winner nahi. Qwen3-TTS NISQA-TTS mein thoda peeche hai, lekin emotion/text/speed trade-off better hai.

Listening examples

Neeche table Japanese aur Chinese happy aur angry samples ke liye same prompt index use karti hai. Yeh clips human listening test nahi hain; automatic metrics samajhne ke liye qualitative anchors hain.

ModelLanguageTargetSenseVoice predictionSample
qwen3_tts_customvoice_1_7bJAhappyunknown
qwen3_tts_customvoice_1_7bJAangryangry
qwen3_tts_customvoice_1_7bZHhappyneutral
qwen3_tts_customvoice_1_7bZHangryangry
cosyvoice_300m_instructJAhappyunknown
cosyvoice_300m_instructJAangryunknown
cosyvoice_300m_instructZHhappyhappy
cosyvoice_300m_instructZHangryneutral
indextts-2JAhappysad
indextts-2JAangrysurprised
indextts-2ZHhappyneutral
indextts-2ZHangryneutral
fish_audio_s1_miniJAhappyhappy
fish_audio_s1_miniJAangryhappy
fish_audio_s1_miniZHhappyneutral
fish_audio_s1_miniZHangryneutral
voxcpm2JAhappyunknown
voxcpm2JAangryangry
voxcpm2ZHhappyhappy
voxcpm2ZHangryangry

Limitations

  • Automatic emotion labels human judgment nahi hain. SenseVoice useful hai kyunki Japanese/Chinese support karta hai, par classifier bias aur language imbalance possible hai.
  • Anchor metrics anchor datasets par depend karte hain. Japanese anchors JVNV se aur Chinese anchors CSEMOTIONS se aaye; is run mein ja/neutral aur zh/disgust missing the.
  • IndexTTS-2 Japanese diagnostic hai. Pooled score strong dikhta hai, par Japanese CER is setup mein bahut high hai.

Further research

  • Qwen3-TTS aur CosyVoice ke liye native-listener MOS/CMOS test chalana.
  • IndexTTS-2 ko abhi Chinese-focused candidate treat karna, ya Japanese tokenizer/text path fix ke baad rerun karna.
  • ja/neutral aur zh/disgust anchors add ya curate karna.
  • Chinese sad, angry, fear, aur disgust ke liye focused human check karna.
  • SenseVoice ko automatic screening metric rakhna, par production decisions human listening tests se lena.

Conclusion

Japanese aur Chinese emotional TTS ke liye Qwen3-TTS CustomVoice 1.7B is benchmark ka sabse balanced model hai. Yeh har emotion solve nahi karta, par emotion recognition, low CER, anchor hit rate, naturalness aur runtime ka sabse practical mix deta hai.

Share this article

VoicePing Free Try Karein

AI Translation ke saath language barriers todein. Aaj hi apna free trial start karein.

Free Mein Start Karein