日語與中文情感 TTS 基準測試

模型與參考資料:

摘要

本研究評估五個情感 TTS 系統在日語與中文上的表現，目標情緒包括 neutral、happy、sad、angry、fear、disgust。所有提示語本身保持中性，因此情緒必須由語音風格表達，而不是靠文字內容暗示。

整體最均衡的候選模型是 Qwen3-TTS CustomVoice 1.7B。它在可信的日語/中文文字輸出前提下，擁有最低平均 CER、最佳 anchor hit rate、穩定自然度，以及最實用的情緒辨識表現。

CosyVoice 300M Instruct 的自然度最佳，但情緒辨識偏弱。IndexTTS-2 的 pooled SenseVoice 分數看起來不錯，但日語 CER 過高，不能視為可靠的日語 TTS 結果。整體而言，中文比日語容易，而 fear 和 disgust 仍未解決。

研究動機

情感 TTS 不只是讓聲音自然。實際產品需要模型同時說對句子、聽起來足夠自然，並表達指定情緒。因此本次評估同時觀察離散情緒辨識、連續情緒 anchor、轉寫正確性、自然度、速度與試聽樣本。

目標日語或中文句子必須保持正確。
語音自然度需足以實際聆聽。
生成語音應表達指定情緒，而不是退回中性語氣或相近情緒。

評估方法

基準測試採用語言、情緒和提示語均衡的生成網格。同一句話會套用到六種情緒上，讓模型不能依賴帶情緒的文字，而必須透過韻律和聲音風格表達情緒。

Experiment design

提示語集合

日語提示語範例：

ID	句子
`ja_001`	会議は午前十時に始まります。
`ja_002`	資料は机の上に置いてあります。
`ja_003`	明日の予定を確認してください。
`ja_004`	電車は三番線から出発します。
`ja_005`	受付で名前を伝えてください。

中文提示語範例：

ID	句子
`zh_001`	会议将在上午十点开始。
`zh_002`	资料已经放在桌子上。
`zh_003`	请确认明天的日程安排。
`zh_004`	列车将从三号站台出发。
`zh_005`	请在前台告知您的姓名。

情緒控制

Target emotion	Control text
`neutral`	Speak in a clear, neutral, natural voice.
`happy`	Speak in a happy, warm, bright voice.
`sad`	Speak in a sad, soft, slow, gentle voice.
`angry`	Speak in an angry, tense, forceful voice.
`fear`	Speak in a fearful, tense, trembling voice.
`disgust`	Speak in a disgusted, displeased, rejecting voice.

評估指標

SenseVoice 情緒準確度：主要的自動篩選指標。
emotion2vec anchor hit 與 margin：使用情緒語音 anchor 中心點進行的輔助診斷指標。
CER：將轉寫結果與原始 prompt 文字比對後得到的字元錯誤率。
NISQA-TTS 與 UTMOS：評估合成語音自然度與品質的診斷指標。
RTF：衡量合成速度的 real-time factor。

結果

資源使用量

資源指標來自 600 個成功生成樣本。GPU、VRAM、wall time 和 RTF 在所有完成列中都有記錄；CPU 對於伺服器型 adapter 則不一定能從取樣程序中捕捉。

Model	Median wall time	Median RTF	Median peak VRAM	GPU util	GPU power	CPU	Median peak RSS
`cosyvoice_300m_instruct`	2.26s	0.85	3.96 GB	30.3% avg / 39.0% peak	145.0W avg / 155.6W peak	127.8% peak; 100% coverage	5.54 GB
`qwen3_tts_customvoice_1_7b`	4.20s	1.58	8.13 GB	22.9% avg / 25.0% peak	126.3W avg / 127.1W peak	138.1% peak; 100% coverage	6.22 GB
`fish_audio_s1_mini`	7.06s	3.47	13.05 GB	25.3% avg / 69.0% peak	150.4W avg / 183.7W peak	not captured; 0% coverage	0.80 GB
`indextts-2`	26.39s	6.97	7.29 GB	18.2% avg / 100.0% peak	131.3W avg / 199.6W peak	not captured; 0% coverage	7.69 GB
`voxcpm2`	28.44s	9.84	12.79 GB	12.3% avg / 100.0% peak	106.7W avg / 191.5W peak	not captured; 0% coverage	10.65 GB

CosyVoice 是最快且 VRAM 最低的模型，但不是情緒控制最強的模型。Qwen3-TTS 的 VRAM 高於 CosyVoice，不過比 IndexTTS-2 和 VoxCPM2 快很多，且在情緒辨識與文字忠實度之間最平衡。

JA/ZH 指標總覽

下表以日語和中文拆分三個核心自動檢查：SenseVoice 情緒準確度、CER 文字忠實度，以及 emotion2vec anchor 對齊。

Model	JA SenseVoice	ZH SenseVoice	JA CER	ZH CER	JA anchor hit	ZH anchor hit	JA anchor margin	ZH anchor margin
`qwen3_tts_customvoice_1_7b`	15.0%	53.3%	8.6%	9.7%	40.0%	64.0%	-0.06645	0.04480
`indextts-2`	43.3%	16.7%	91.0%	10.3%	38.0%	30.0%	-0.08293	-0.04063
`voxcpm2`	6.7%	35.0%	18.6%	4.4%	40.0%	36.0%	-0.04479	-0.02693
`cosyvoice_300m_instruct`	1.7%	36.7%	43.9%	11.1%	24.0%	72.0%	-0.05481	0.03796
`fish_audio_s1_mini`	6.7%	16.7%	12.7%	16.8%	20.0%	24.0%	-0.08972	-0.09542

中文在自動情緒指標上通常更容易，但 CER 與情緒準確度並不總是同步。Qwen3-TTS 在兩種語言都保持低 CER；IndexTTS-2 的日語 SenseVoice 分數最高，但日語 CER 也最差。

文字忠實度（CER）

CER by language

文字忠實度方面，Qwen3-TTS 是最穩定的 JA/ZH 結果：日語 CER 8.6%，中文 CER 9.7%。IndexTTS-2 是警訊案例，因為它的日語 CER 達到 91.0%。

情緒準確度

SenseVoice

SenseVoice accuracy by language

在這個自動設定中，中文明顯比日語容易。Qwen3-TTS 的中文 SenseVoice 準確率為 53.3%，日語為 15.0%；但兩者 CER 都低，表示問題不只是可懂度，而是日語情緒線索較弱或與 SenseVoice 的判斷不一致。

Per-emotion SenseVoice recall by model and language

fear 和 disgust 是最困難的標籤。所有模型/語言組合的 SenseVoice recall 都是 0.0%，常被歸到 sad、neutral、angry 或 unknown。

列是目標情緒，欄是 SenseVoice 預測。綠色框代表理想的對角線。

Japanese SenseVoice confusion matrices

Chinese SenseVoice confusion matrices

案例	觀察結果	為什麼重要
`indextts-2 / ja`	`happy` -> `sad` 4/10；`fear` -> `sad` 5/10；`disgust` -> `angry` 10/10。	即使日語文字品質不可靠，情緒標籤看起來仍可能合理。
`qwen3_tts_customvoice_1_7b / zh`	`happy` -> `neutral` 5/10；`fear` -> `sad` 9/10；`disgust` -> `neutral` 9/10。	Qwen 是最均衡的候選模型，但困難情緒仍會坍縮。
`cosyvoice_300m_instruct / ja`	`happy` -> `unknown` 10/10；`fear` -> `unknown` 9/10；`disgust` -> `unknown` 8/10。	高自然度不代表情緒控制能被穩定辨識。
`fish_audio_s1_mini / zh`	`happy` -> `neutral` 10/10；`fear` -> `neutral` 9/10；`disgust` -> `neutral` 8/10。	inline 情緒標記沒有穩定改變生成語音的韻律。
`voxcpm2 / zh`	`happy` -> `neutral` 7/10；`fear` -> `neutral` 6/10；`disgust` -> `neutral` 10/10。	prompt 驅動控制常常退回中性語音。

emotion2vec Anchor

emotion2vec anchor hit and margin by language

Anchor 指標和 SenseVoice 顯示類似趨勢：中文 anchor 比日語更有利。正 margin 代表生成音訊比起其他情緒中心更接近目標情緒中心。Qwen3-TTS 在中文為正 margin，但所有日語 margin 都是負值。

自然度

Naturalness diagnostics by model

Model	Mean NISQA-TTS	Low NISQA-TTS <3.0	Mean UTMOS	Low UTMOS <3.0
`cosyvoice_300m_instruct`	4.267	0.0%	3.282	20.8%
`indextts-2`	4.063	11.7%	2.078	93.3%
`qwen3_tts_customvoice_1_7b`	4.007	0.8%	2.939	51.7%
`fish_audio_s1_mini`	3.935	3.3%	2.932	55.8%
`voxcpm2`	3.788	8.3%	2.596	76.7%

自然度與情緒正確性是不同問題。CosyVoice 在自然度上最強，但不是情緒控制最強。Qwen3-TTS 的 NISQA-TTS 稍低於 CosyVoice，但在情緒、文字正確性與速度的整體取捨上更好。

試聽範例

下表使用相同 prompt index 挑選日語與中文的 happy、angry 範例。這些片段不是人類聽測，而是幫助解讀自動指標的質化 anchor。

Model	Language	Target	SenseVoice prediction
`qwen3_tts_customvoice_1_7b`	JA	happy	unknown
`qwen3_tts_customvoice_1_7b`	JA	angry	angry
`qwen3_tts_customvoice_1_7b`	ZH	happy	neutral
`qwen3_tts_customvoice_1_7b`	ZH	angry	angry
`cosyvoice_300m_instruct`	JA	happy	unknown
`cosyvoice_300m_instruct`	JA	angry	unknown
`cosyvoice_300m_instruct`	ZH	happy	happy
`cosyvoice_300m_instruct`	ZH	angry	neutral
`indextts-2`	JA	happy	sad
`indextts-2`	JA	angry	surprised
`indextts-2`	ZH	happy	neutral
`indextts-2`	ZH	angry	neutral
`fish_audio_s1_mini`	JA	happy	happy
`fish_audio_s1_mini`	JA	angry	happy
`fish_audio_s1_mini`	ZH	happy	neutral
`fish_audio_s1_mini`	ZH	angry	neutral
`voxcpm2`	JA	happy	unknown
`voxcpm2`	JA	angry	angry
`voxcpm2`	ZH	happy	happy
`voxcpm2`	ZH	angry	angry

限制

自動情緒標籤不等於人類判斷。 SenseVoice 支援日語與中文且容易映射到本基準標籤，但仍可能有分類器偏差與語言不平衡。
Anchor 指標依賴 anchor 資料集。 日語 anchor 來自 JVNV，中文 anchor 來自 CSEMOTIONS；本次執行缺少 ja/neutral 與 zh/disgust anchor。
IndexTTS-2 的日語結果屬於診斷性質。 pooled 情緒分數看似強，但在這個設定下日語 CER 過高。

後續研究

針對 Qwen3-TTS 與 CosyVoice 進行小規模母語聽者 MOS/CMOS 測試，分別評分自然度、情緒正確性與文字可懂度。
暫時將 IndexTTS-2 視為偏中文模型，或修正日語 tokenizer/text path 後重新執行。
補齊或整理缺少的 ja/neutral 與 zh/disgust 情緒 anchor。
針對中文 sad、angry、fear、disgust 做聚焦的人類評估。
保留 SenseVoice 作為自動篩選指標，但產品決策仍應由人類聽測確認。

結論

在日語與中文情感 TTS 中，Qwen3-TTS CustomVoice 1.7B 是本次基準測試最均衡的模型。它尚未解決所有情緒，但在情緒辨識、低 CER、anchor hit rate、自然度與速度之間提供了最實用的組合。

情感 TTS 基準測試：Qwen3-TTS、CosyVoice、IndexTTS-2、Fish Audio 與 VoxCPM 在日語和中文上的表現

摘要